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Preface 



SAGA 2001, the first Symposium on Stochastic Algorithms, Foundations and 
Applications, took place on December 13-14, 2001 in Berlin, Germany. The 
present volume comprises contributed papers and four invited talks that were 
included in the final program of the symposium. 

Stochastic algorithms constitute a general approach to finding approximate 
solutions to a wide variety of problems. Although there is no formal proof that 
stochastic algorithms perform better than deterministic ones, there is evidence 
by empirical observations that stochastic algorithms produce for a broad range 
of applications near-optimal solutions in a reasonable run-time. 

The symposium aims to provide a forum for presentation of original research 
in the design and analysis, experimental evaluation, and real-world application of 
stochastic algorithms. It focuses, in particular, on new algorithmic ideas involv- 
ing stochastic decisions and exploiting probabilistic properties of the underlying 
problem domain. The program of the symposium reflects the effort to promote 
cooperation among practitioners and theoreticians and among algorithmic and 
complexity researchers of the field. In this context, we would like to express our 
special gratitude to DaimlerGhrysler AG for supporting SAGA 2001. 

The contributed papers included in the proceedings present results in the 
following areas: Network and distributed algorithms; local search methods for 
combinatorial optimization with application to constraint satisfaction problems, 
manufacturing systems, motor control unit calibration, and packing flexible ob- 
jects; and computational learning theory. 

The invited talk by Juraj Hromkovic surveys fundamental results about ran- 
domized communication complexity. In the talk, the efficiency of randomized 
communication is related to deterministic and nondeterministic communication 
models. Martin Sauerhoff discusses randomized variants of branching programs 
which allow the relative power of deterministic, nondeterministic, and random- 
ized algorithms to be studied. Recent results on random 3-SAT formulas are 
summarized by Gregory Sorkin. The focus is on bounds for their density and 
shows how to tune so-called myopic algorithms optimally. Thomas Zeugmann 
gives an overview on stochastic finite learning that connects concepts from PAG 
learning and models of inductive inference learning. 

Our special thanks go to all who supported SAGA 2001, to all authors who 
submitted papers, to all members of the program committee who provided very 
detailed referee reports, to the invited speakers, to the organizing committee, 
and to the sponsoring institutions. 
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Kathleen Steinhofel 
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Randomized Communication Protocols 
(A Survey) 



Juraj Hromkovic* 

Lehrstuhl fiir Informatik I, RWTH Aachen, 
Ahornstrafie 55, 52074 Aachen, Germany 



Abstract. There are very few computing models for which the power 
of randomized computing is as well understood as for communication 
protocols and their communication complexity. Since the communication 
complexity is strongly related to several complexity measures of distinct 
basic models of computation, there exist possibilities to transform some 
results about randomized communication protocols to other computing 
models, and so communication complexity has established itself as 
a powerful instrument for the study of randomization in complexity 
theory. The aim of this work is to survey the fundamental results 
about randomized communication complexity with the focus on the 
comparison of the efficiency of deterministic, nondeterministic and 
randomized communication. 

Keywords: Randomized computing, communication complexity, two- 
party protocols. 



1 Introduction 

The communication complexity of two-party protocols was introduced by Yao 
[1] in 1978-79. (Note that communication complexity was implicitly considered 
by Abelson [2], too.) The initial goal was to develop a method for proving lower 
bounds on the complexity of distributed and parallel computations, with a spe- 
cial emphasis on VLSI computations [3,4,5,6,7,8,9,10,11]. 

But the study of communication complexity contributed to complexity theory 
much more than one had expected at the beginning in the early eighties and 
recently its relation to VLSI theory does not belong to the most important 
applications of communication protocols. The communication complexity theory 
established itself as a subarea of complexity theory and the main reasons for this 
successful story are the following ones: 

(i) Communication complexity (similarly as Kolmogorov complexity) became a 
powerful method in proving lower bounds on fundamental complexity mea- 
sures of concrete problems (see [3,4] for some surveys). In some applications 
it caused a breakthrough in the long efforts in proving lower bounds (see, for 

* Supported by the DFG Project Hr. 

K. Steinhotel (Ed.): SAGA 2001, LNCS 2264, pp. 1-32, 2001. 
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instance, [12,13,14,15,16,17]). This succeeded especially due to the develop- 
ment of a powerful nontrivial mathematical machinery for determining the 
communication complexity of concrete problems, while the relation of com- 
munication complexity to other complexity measures is typically transparent 
and usually easy to get. 

(ii) Communication complexity essentially contributed to our understanding of 
randomized computation. There are very few computing models for which 
the power of randomized computing is as well understood as for two-party 
communication protocols. Moreover, due to (i) one can extend the results 
about randomized communication complexity to some other basic computing 
models. 

This survey focuses on the study of randomized communication protocols. 
It is organized as follows. Section 2 presents the basic model of two-party com- 
munication protocols and its nondeterministic and randomized extensions. Here, 
we consider Las Vegas, one-sided-error and two-sided-error (bounded-error) ran- 
domization. Section 3 gives a short overview about methods for proving lower 
bounds on communication complexity. The central parts of this paper are Sec- 
tion 4 and 5. Section 4 relates the efficiency of randomized communication to 
the deterministic and nondeterministic communication models. Section 5 inves- 
tigates whether one can restrict the number of random bits without restricting 
the power of randomized communication protocols. Because of the lack of space 
we do not present any survey of applications of results presented in Sections 4 
and 5 to other computing models. 

2 Definition of Communication Protocols 

Informally, a two-party (communication) protocol consists of two comput- 
ers C[ and Cn and a communication link between them. A protocol computes a 
finite function f : U xV ^ Z in the following way. At the beginning Cj obtains 
an input a G U and Cn obtains an input (3 GV . Then (7/ and Cn communicate 
according to the rules of the protocol by exchanging binary messages until one 
of them knows /(a,/3). The communication complexity of the computa- 
tion on an input (a, (3) is the sum of the lengths of messages exchanged. The 
communication complexity of the protocol is the maximum of the complexities 
over all inputs from U x V. The communication complexity of /, cc(/), is 
the complexity of the best protocol for /. 

Typically, one considers the situation that U = {0, 1}”, V = {0, 1}” and 
Z = {0,1}, i.e. / is a Boolean function of 2n variables. Since this restriction is 
sufficient for our purposes we give the formal definition for protocols computing 
Boolean functions only. In what follows we describe the computation of a protocol 
on an specific input by a string ci$C 2 $ . . . $Cfc$Cfc+i, where Ci G (0, 1}’*' for i = 
I,. .. ,k are the messages exchanged (ci is the first message sent from C/ to 
Cn, C 2 is the second message submitted from Cn to Cj, etc.) and Ck+i is 
the result of the computation. The part ci$C 2 $ . . . $Cfc is called the history of 
communicat ion . 
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Definition 1. Let f : {0,1}” x {0,1}” — >■ {0,1} be a Boolean function over 
a set X = {xi,X 2 , ■ ■ ■ ,X 2 n} of Boolean variables. A protocol P over X (or 
over {0,1}” X {0,1}”} is a function from {0,1}" x {0,1,$}* to {0,1}+ U 
{accept, reject} such that 

(i) P has the prefix- freeness property; 

For each c G {0,1,$}* and any two different a,P € {0,1}", P{a,c) is no 
proper prefix of P{P, c) . 

{The next message P{a,c) is computed either by Cj or by Cu in the depen- 
dence of the input a of Cj (Cu) and the history c of communication. The 
prefix-freeness property assures that the messages exchanged between the two 
computers are self-delimiting, and no extra “end of transmission” symbol is 
required. To visualize the end of messages in the history of communication 
c, we write the special symbol $ at the end of every message.} 

(a) If <P{a,c) G {accept, reject} for an a € {0,1}™, and c G ({0,1}+$)^^ for 
some p G IN [for c G ({0,1}+$)^^+^/, then for all q G IN, 7 G {0,1}™, 
d G ({0,1}+$)^'?+^ [d G ({0,1}+$)^*/, P{j,d) ^ {accept, reject} for every 
communication history d G {0, 1,$}*. 

{ This property assures that the output value is always computed by the same 
computer independently of the input assignment.} 

(Hi) For every c G {0,1,$}*, if P{a,c) G {accept, reject} for some a G {0,1}", 
then P{(3, c) G {accept, reject} for every /3 G {0, 1}”. 

{This property assures that if the computer Cf (Cu) computes the output 
for an input, then the other computer Cu (Cj) knows that Ci (Cu) knows 
the result, and so it does not wait for further communication.} 

A computation of P on an input (a,/3) G {0,1}" x {0,1}” is a string 
c = Ci$C 2 $ . . . $Cfc$Cfc+i, where 

(1) k > 0, Cl, . . . , Cfc G {0, 1}+, Cfc+i G {accept, reject}, and 

(2) for every integer I, 0 < I < k, 

(2.1) if I is even, t/ien cj+i = P(a, Ci$C 2 $ . . . $c;$), and 

{The message c;+i is sent by Cj, and Cj computes c/+i in dependence 
of its input part a and of the whole current communication history 

Ci$C2$ . . . $C/$.} 

(2.2) if I is odd, t/ien c;+i = P(/3, ci$C 2 $ . . . $c/$). 

{The message ci+\ is sent from Cu to Cj, and Cu computes Ci+i in the 
dependence of its input [3 and the communication history ci$C 2 $ . . . $c;$.} 

For every computation c = ci$C 2 $ . . . $Cfc$Cfe+i, 

Com(c) = ci$C 2 $ . . . $Cfc$ 

is the communication (or communication history} of c. 

We say that P computes f if, for every {a, (3) G {0, 1}" x {0, 1}", the com- 
putation of P on {a, (3) is finite and ends with “accept” if and only if f {a, [3) = 1. 
In what follows we also say that a computation is accepting (rejecting} if it 
ends with accept (reject}. 
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The length of a computation c is the total length of all messages in c 
(ignoring $ ’s and the final accept/rejectj. The communication complexity 
of the protocol P, cc(P), is the maximum of all computation lengths over all 
inputs {a,j3) G {0,1}" x {0,1}”. 

The communication complexity of / is 

cc(/) = min{cc(P) | P computes /}. 



□ 



We use the notation P{a,(3) for the output G {accept, reject} of the com- 
putation of the protocol P on the input {a, (3). Note that “accept” is used to 
denote the result “1”, and “reject” is used to denote the result “0”. We use the 
notations “accept” and “reject” instead of “1” and “0” in order to distinguish 
between the communication bits and the results. Besides the reason above it will 
be convenient to be able to speak about accepting and rejecting computations 
in what follows. 

To illustrate the above definition we consider the following simple examples. 
Let 

n 

Eq„(xi,a:2, ■ . . , a;„, j/i, 2/2, ■ • • ,2/«) = f\xi = yi 

i=l 

be the Boolean function of 2n variables from {0, 1}" x {0, 1}" to {0, 1} that takes 
the value 1 iff the first half of the input is equal to the second half of the input. 
A protocol P computing Eq„ can simply work as follows. Cj sends its whole 
input a G {0,1}" to Cu and Cu compares whether a is identical with its input 
(3 G {0, 1}". Formally, 



P{a, A) = a, and^ 



a a = (3. 
ii a ^ (3. 



We observe that the above described strategy (C/ sends its whole input to 
Cu) works for every function and so 



cc(/) < n 

for every Boolean function / : {0, 1}" x {0, 1}" — >■ {0, 1} of 2n variables. 

Now, consider the symmetric function S 2 n ■ {0, 1}^" — >■ {0, 1} that takes the 
value 1 if and only if the number of I’s in the input is equal to the number of 
O’s in the input. A simple way to compute S 2 n by a protocol follows. Cj sends 
the binary representation of the number of I’s in its input a G {0, 1}". 

Cu checks whether # 1 ( 0 ;) -I- #i(/3) = n, where /3 G {0, 1}” is the input of Cu. 
Obviously, the communication complexity of this protocol is |"log 2 (n-|- 1)]. 

Note that we shall later show that the above protocols for Eq„ and S 2 n are 
optimal. We observe that these protocols are very simple because the whole 

^ A denotes the empty word. 
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communication consists of the submission of one message. The protocols whose 
computations contains at most one message are called one-way protocols in 
what follows. For every Boolean function / 

cci(/) = min{cc(P) | P is a one-way protocol computing /} 

is the one-way communication complexity of /. 

In Section 3 we shall see that there can be an exponential gap between 
cci(/) and cc(/). The following function /i„d(n)(a;i, X2, . . . , yi, j/2, • ■ • , Vn) for 
n = 2^, k G IN — {0}, is an example of a computing problem where one can 
profit from the exchange of more than one message between Cj and Cjj. Let, 
for every binary string a = a^ai . . . Ofc, 

k 

Number(a) = Ofc-i • 2* 

i=0 

be the number with the binary representation a. The function /ind(n) takes the 
value 1 iff 



^Number(y]sjuniber(3;ia;2'"^flog2 n] ) + l ' ' -yCNumberCa:! X 2 ■ - .a: [log 2 n] )+ ^1) naod n) + l 

Informally, the first log2 n values of the variables Xi,X2, ■ ■ ■ , a^iogj n determine a 
position (an index) a = Number(a:i, . . . , x\og^ n) + l- Va and the following log2 n— 
1 values of y variables determine an index 6, and one requires Xb = 1 for the result 
1. We describe a protocol P that computes /ind(n)- C/ sends X\X2 ■ ■ ■ x\\og^ „] to 
Cn- After that Cn sends the message ?/o2/(a+i) mod « • • • y(a+\os2 «-i) mod n to Cj, 
where a = Number(xi . . . S[iog2 n] ) + 1- Now, Cj accepts iff 

^Number(yay(a+l) mod n.---y(a+flog 2 nl- 1 ) mod n)-|-l “ 

The communication complexity of this protocol is 2 • |"log 2 n] . 

In what follows we use, for all nonnegative integers l,k, I > |"log 2 k~\, I > 1, 
BINj(fc) to denote the binary representation of fc by a binary string of length 
1. This means that if I > [log 2 A:], I — [log 2 A:] leading O’s are added to the 
representation. 

One can introduce nondeterminism for protocols in the usual way. Because 
of this we prefer to give an informal description of nondeterministic protocols 
rather than an exact formal definition. 

Let f '■ U xIL— >-{0,l}bea finite function. A nondeterministic protocol 
P computing on inputs from U x V consists of two nondeterministic computers 
Cl and Cij. At the beginning Cj obtains an input a G U, and Cn obtains an 
input f3 G V. As in the deterministic case, the computation consists of a number 
of communication rounds, where in one round one computer sends a message to 
the other one. The computation finishes when one of the computers decides to 
accept or to reject the input {a,j3). In contrast to the deterministic case, Ci can 
be viewed as a relation on 



{U X {0, 1, $}*) X ({0, !}■'" U {accept, reject}) 
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and Cii can be viewed as a relation on 

{V X {0, 1, $}*) X ({0, !}■'■ U {accept, reject}). 

This means that, for every argument (a, c), Cj (Cn) nondeterministically 
chooses a message from a finite set of possible messages determined by the ar- 
gument (a,c). 

We say that P computes / if for every (a, f3) G U x V, 

(i) if /(a, /?) = 1, then there exists an accepting computation of P on the input 
{a, (3), and 

(ii) if f{a,P) = 0, then all computations of P on {a, (3) are rejecting ones. 

Again, we require that the prefix-freeness property and the property that 
exactly one computer takes the final decision for all inputs are satisfied. 

The nondeterministic communication complexity of P, denoted by 
ncc(P), is the maximum of the lengths of all accepting computations of P. The 

nondeterministic communication complexity of / is 

ncc(/) = min|ncc(P) | P is a nondeterministic protocol computing /}. 

To show the power of nondeterminism in communication consider, for every 
positive integer n, the function Ineq„ : (0, 1}” x (0, 1}” — >■ (0, 1} defined by 

n 

Ineq„(xi,a; 2 , . . .,Xn,yi,y 2 , ■■■,yn)= \/{x^® y*), 

i=l 

where © denotes the exclusive or (plus mod 2). 

A nondeterministic protocol P that accepts all inputs (a,/3) € {0,1}" x 
{0,1}" with a ^ (3 can work as follows. For every input a = a\a2 ■ ■ - oim Ci 
nondeterministically chooses a number i G {l,...,n} and sends the message 
a^BINj-iog^ „1 (i) to Cjj, where BIN „1 (i) is the binary representation of i of 
the length^ [^og 2 n] . Now, for every input (3 = /?i . . . /3„ of Cn, Cn accepts if 
and only if yf j3i. In the case Ui = Pi, Cn rejects the input. Obviously, for 
all a,P G {0, 1}" with a yf /3, there exists a j such that Uj yf Pj and so there 
exists an accepting computation of P on (a,P). On the other hand if a = P, 
then all n different computations of P on (a, P) are rejecting. So, P computes 
Ineq„ within the communication complexity [log 2 n] + 1, i.e., 

ncc(Ineq„) < [log 2 n] + 1. 

Similarly as in the deterministic case, one can consider one-way nonde- 
terministic protocols whose computations contain at most one message. Let 
ncci(/) denote the one-way nondeterministic communication complex- 
ity of f. In contrast to the deterministic case, we show that there is no difference 
in the power of nondeterministic protocols and one-way nondeterministic proto- 
cols. 

This means, that additional O’s are taken to achieve the required length, if necessary. 
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Theorem 1. For every finite function f, 

ncc(/) = ncci(/). 

Proof. Let / be a function from U x V to {0,1}, and let D = {Ci,Cu) 
be a nondeterministic protocol computing /. We construct a one-way nonde- 
terministic protocol Di that computes / within the communication complexity 
ncc(-D). In what follows we say that a computation C = ci$C2$ . . . $Cfc$Cfc+i is 
consistent for Cj with an input a G U (or from the point of view of Cj and 
a) if Cl with the input a has the possibility to send the messages ci, C3, C5 , . . . 
when receiving the messages C2,C4 ,ce,... from Cjj. Similarly, one can define 
the consistency of C according to C// with an input j3 GV. Obviously, if C is a 
consistent computation for Cj on an input a and for Cu on an input then C 
is a possible computation of (C/,C//) on the input (a,/3). 

Let Cl, C2 , . . . , Ofc be all accepting computations of D over all inputs from 
U X V. Clearly, k < Let Di consist of the computers Cj and Cjj. For 

every a G U, C] nondeterministically chooses an i G {l,...,fc|, and it sends 
the message BIN„(.c(d)(*) to Cjj if Ci is a possible accepting computation of D 
from the point of view of C/ and a. In other words, C] can send the binary 
representation of i if and only if there exists an 7 G C such that Ci is the 
accepting computation of D on (0,7). For every input (3 of C}j, when C]j 
receives the binary representation of a number i, then Cjj accepts if Ci is a 
consistent accepting computation from the Cu and f point of view. 

So, D\ has exactly the same number of accepting computations as D and 
ncci(Hi) = ncc(H). □ 

There are two distinct ways to introduce randomized protocols. One possi- 
bility is to take a nondeterministic protocol and to consider a probability dis- 
tribution for every possible nondeterministic guess. Such randomized protocols 
are called private because each of the computers takes its random bits from 
a separate source; i.e., Cj (Cu) does not know the random bits of Cu (C/). 
If one of the computers wants to know the random bits influencing the choice 
of the message submitted by the other computer, then these bits must be com- 
municated. Another possibility to define randomized protocols is to say that 
a randomized protocol is a probability distribution over a set of deterministic 
protocols. Such randomized protocols are called public randomized protocols 
because this model corresponds to the situation when both computers have the 
same source of random bits (i.e. everybody sees the random bits of the other 
one for free) . This second approach represents the well-known paradigm “foiling 
an adversarf' of the design of randomized algorithms. So, every efficient public 
randomized protocol can be viewed as a successful application of this paradigm. 

Clearly, public randomized protocols are at least as powerful as private ones. 
Newman [18] has proved that the relations between the communication com- 
plexities of public randomized protocols and private ones are linear for every 
bounded-error model. We shall present this result in Section 6. Because of this 
and in order to simplify the matters we formally define the public versions of 
randomized protocols only. 
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In the following definitions we use also the notation P{a,/3) = 1 (0) instead 
of P{a,P) = accept (reject). 

Definition 2. Let U and V be finite sets. A randomized protocol over U X 

V is a pair R = {Prob, S), where 

(i) S = {I?i, D 2 , . . . , Dk} is a set of (deterministic) protocols over U x V, and 
(a) Prob is a probability distribution over the elements of S. 

For i = 1,2, ... ,k, Prob(Di) is the probability that the protocol Di is ran- 
domly chosen to work on a given input. 

For an input {a, (3) G U xV , the probability that R computes an output 
2 ; is 

Prob(R{a, (3) — z) = ^ Prob{D). 

De{r>i Dfc} 

D(a,,l3) = z 

The communication complexity of R is 

cc{R) = max{cc(D) | D G S}. 

A randomized protocol R = {Prob, S) is called one-way if all elements of S 
are one-way protocols. 

For every randomized protocol R = {Prob, {Di, D 2 , . . . , Dk}) with a uniform 
probability distribution Prob, the degree of randomness of P is [log 2 k~\ . Since 
one can unambiguously identify every Di with a binary string of length |"log 2 fc] , 
we call [log 2 k~\ the number of random bits of R, too. 

In what follows we consider only randomized protocols computing Boolean 
functions. In contrast to the previous protocol models we also allow a “neutral” 
output “?” , whose meaning is that the protocol was not able to compute any final 
answer in the given computation (random attempt). We consider this possibility 
for Las Vegas protocols that never err but may produce the answer “?” with a 
bounded probability. 

Note that in what follows we use also the notation R{a, P) = 1 {R{a, /3) = 0) 
instead of R{a,P) = ’’accept” {R{a,P) = ’’reject”). This is convenient because 
we obtain the possibility to speak about the probability of the event R{a, P) = 
f{a,P) in this way. 

Definition 3. Let f : U x V ^ {0,1} be a finite function. We say, that a 
randomized protocol R = {Prob, S) is a (public) Las Vegas protocol for / if 

(i) for every {a, P) GU xV with f{a,P) = 1, 

Prob{R{a, P) = 1) > | and Prob{R{a, P) = 0) = 0, and 
(a) for every {a, P) GU xV with f{a,P) = 0, 

Prob{R{a, P) = 0) > | and Prob{R{a, P) = 1) = 0. 

The Las Vegas communication complexity of / is 

lvcc(/) = min{cc(i?) \ R is a Las Vegas protocol for /}. 

The one-way Las Vegas communication complexity of / is 

lvcci(/) = min{cc(i?) \ R is a one-way Las Vegas protocol for /}. 
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We present a simple example of a one-way Las Vegas protocol. More involved 
ideas for the design of Las Vegas protocols can be found in Section 4. Consider 
the function Index„ : {0, 1}” x {1, 2, . . . , n} — >■ {0, 1} defined as follows^ 

Index„((xi,X2, . . . = xj. 

A Las Vegas one-way protocol D for Index„ can be described as the pair (Prob, 
{£>1, £>2}), where 

(i) Prob{Di) = Prob{D 2 ) = 

(ii) £>i = (£)/,i, £)//^i), where £>74 sends the first half of its in- 
put to Dll i. Dll I outputs aj if the input j of Du i belongs to 

{1,2,..., [f]}. If j > [f], then Dn^i outputs “?”, 

(iii) £>2 = (£*/,2, £*77,2) where Zl/,2 sends the second half a|-|-]+i . . . a„ of its 

input to Du 2 - Du , 2 outputs aj if the input j of Du 2 belongs to 

{[I] -I- 1, . . . , n}. If j < |"|] , then Du , 2 outputs “?” . 

Another possibility to describe D is as follows. 

Las Vegas one-way protocol D — {Di,Du) for Index„. 

Input: (a, j), a = m . . . G {0, 1}", j G {1, . . . ,n}. 

{ Di gets the input a, and Du gets the input j.} 

Step 1: Di chooses a random bit r G {0, 1}. 

{ Note, that Du knows r, too.} 

If r = 0, then £)/ sends the message 0102 ■ • ■ ] • 

If r = 1, then Di sends the message o;|-;j-|+ia|-n-|+2 • • • oin- 
Step 2: If r = 0 and j G (1, 2, ... , |"f ]}, then Du outputs aj. 

If r = 1 and j > |"|], then Du outputs aj. 

Else, Du outputs “?”. 

In what follows we shall prefer the second form of the description of random- 
ized protocols. Clearly, D never errs, and the probability of giving the output 
“?” is i for every input (a, j) G {0, 1}” x (1, . . . ,n|. The communication com- 
plexity of £) is |"|]. 

Note, that the constant | bounding the probability of the output “?” in the 
definition of (two-way) Las Vegas protocols is not essential from the asymptotic 
point of view. Instead of giving the output “?” a Las Vegas protocol may start 
a new communication from the beginning with new random bits. If it outputs 
“?” only if it reaches “?” in k independent computation attempts, then the 
probability to obtain the output “?” decreases from ^ to but the communi- 
cation complexity increases only by a factor of k in comparison with the original 
protocol. 

Definition 4. Let f : U x V ^ (0, 1} be a finite function. We say that a ran- 
domized protocol R = (Prob,S) is a (public) one-sided-error Monte Carlo 
protocol for / if 

® Observe, that Index„ can be also viewed as a Boolean function if one represents the 
numbers 1, 2, . . . , n by binary strings. 
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(i) for every (a, 3) €U xV with f(a,B) = 1, 

Prob{R{aJ) = l)>k,and 
(a) for every {a, (3) €U xV with f{a,f3) = 0, 

Prob{R{a,f3) = 0) = 1. 

We say that a randomized protocol R is a (public) two-sided-error 
Monte Carlo protocol for f if, for every {a, (3) G U xV, 

Prob{R{a,(3) = f{a,!3)) > 

The one-sided-error Monte Carlo communication complexity of / 

is 



lmccc(/) = min{cc(i?) \ R is a one-sided-error 

Monte Carlo protocol for /}. 

The two-sided-error Monte Carlo communication complexity of / is 

2mccc(/) = min{cc(i?) \ R is a two-sided-error 

Monte Carlo protocol for /}. 

Because of the condition (ii) of one-sided-error Monte Carlo protocols it is 
clear that private one-sided-error Monte Carlo protocols are a restricted version 
of nondeterministic ones. For the public randomized protocols defined here we 
obtain 

ncc(/) < lmccc(/) -I- the number of random bits 
for every finite function /. 

Similarly as in the case of Las Vegas, the constant | in the inequality 
Prob{R{a, (3) = 1) > ^ is not essential for one-sided-error Monte Carlo from 
the asymptotic point of view and so one-sided-error Monte Carlo protocols can 
be viewed as a restricted version of two-sided-error Monte Carlo protocols, too. 

Example 1. (based on [19]) The idea of a randomized protocol is based on the 
abundance of witnesses method for the design of randomized algorithms. 
Let / be a Boolean function. A witness for /( 7 ) = a is any binary string 
5, such that using 5 there is an efficient way to prove (verify) that /(y) = a. 
For instance, any factor (nontrivial divisor) y of a number a; is a witness of 
the claim “x is composite'’ . Obviously, to check whether x mod y = 0 is much 
easier than to prove that x is a composite without any additional information. In 
general, one considers witnesses only if they essentially decrease the complexity 
of computing the result. For many functions the difficulty with finding a witness 
deterministically is that the witness lies in a search space that is too large to be 
searched exhaustively. However, by establishing that the space contains a large 
number of witnesses, it often suffices to choose an element at random from the 
space. The randomly chosen item is likely to be a witness. If this probability 
is not high enough, an independent random choice of several items reduces the 
probability that no witness is found. 
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The framework of this approach is very simple. One has for every input 7 
a set CandW( 7 ) that contains all items candidating to be a witness for the 
input 7 . Often CandW( 7 ) is the same set for all inputs of the same size as 7 . 
Let Witness ( 7 ) contain all witnesses for 7 that are in CandW( 7 ). The aim is 
to reach a situation where the cardinality of Witness( 7 ) is proportional to the 
cardinality of CandW( 7 ). 

To design a randomized protocol for Ineq„, we say that a prime p is a 
witness for a. ^ j3, a,j3 & {0, 1}”, if 

Number(o;) mod p yf Number(/3) mod p. 

For every input {a, (3) S {0, 1}" x {0, 1}”, CandW(a,/3) is the set of all primes 
from {2,3,..., n?}. Due to the Prime Number Theorem we know that jCandW 
(a,/3)| is approximately Now, we estimate the lower bound on [Witness 

(a,/3)|. Let j3. If, for a prime p, 

Number(a) mod p = Number(/3) mod p, (1) 

thenp divides h = Number(o;)— Number(/3). Since h < 2”, h has fewer than n dif- 
ferent prime divisors.^ This means that at most n — 1 primes from CandW(a, /?) 
have the property (1). So, 

|Witness(a, 7)1 > |CandW(o;,/3)| — n -|- 1. (2) 

Now, we use (2) to design our randomized protocol. 

One-sided-error Monte Carlo protocol R — (i?/, Ru) for Ineq^^. 
Input: (a,P) G {0,1}" x {0,1}" 

Step 1: Rj chooses uniformly a prime p G {2, 3, . . . , n^} at random. 

{Note, that Ru knows this choice of Rj.} 

Step 2: Rj computes s = Number(a) mod p and sends the binary representa- 
tion of s to Rjj. 

{Note, that the length of the message is |"log 2 n^] < 2 • |"log 2 n] .} 

Step 3: Rjj computes q = Number(/3) modp. 

If p yf s, then Rjj outputs 1 ( “accept” ) . 

If <7 = s, then Rjj outputs 0 (“reject”). 

We show that i? is a one-sided-error Monte Carlo protocol for Ineq„. If 
a = (3, for an input {a, (3) G {0,1}” x {0,1}”, then Number(a) mod p = 
Number(/3) mod p for every prime p. So, 

Prob{R{a, P) = “reject”) = 1. 

Let a yf /?, i.e. Ineq„(a,/3) = 1. Due to the inequality (2), the probability 
that R chooses a prime with the property (1) is at most 

|CandW(a, /3)| — |Witness(a, 7)1 n — 1 

|CandW(a,/3)| “ |CandW(a, /3)| ' 

Observe, that n! > 2". 



4 
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Since |CandW(a, /3)| > 2 \nn^ already for small n’s, the probability that R 
rejects {a, (3) is at most 

n — 1 n — 1 21 nn^ 

|CandW(a, /3)| “ n^/21nn2 “ n 



Thus, 

2 In T? ^ 

Prob{R{a, (3) = “accept”) >1 , 

n 

that even tends to 1 for sufficiently large n’s. So, we have proved 

lmccc(Ineq„) = 0 (log 2 n). 



3 Lower Bounds on Communication Complexity 

In this section we do not deal with randomized protocols. The aim here is to 
present methods for proving lower bounds on the communication complexity of 
deterministic and nondeterministic protocols. These methods will be used in the 
subsequent sections to show the power of randomization in the comparison with 
deterministic and nondeterministic communications. 

The most transparent way to explain the methods for proving lower bounds 
on communication complexity is to consider the representation of functions to 
be computed in the form of so-called communication matrices. 

Definitions. Let U = {a\, . . . ,ak}, V = {/3i, . . . , Pm} be two sets, and let 
f : U X V ^ {0,1}. The communication matrix of f is the \U\ x \V\ Boolean 
matrix Mf = defined by 

Qij = = f {cXi, Pj') . 

For every 7 G [/, the row corresponding to the input part 7 is called the row 
of 7 and it is denoted by row..^ = , a.y, 32 , . . . Similarly, the column 

corresponding to an input part S € V is called the column of 8, and it is denoted 
by column^ = (flaii, . Oaici)- 

For all non-empty index sets {i\, . . . ,ii} Q {1, . . . , k}, {ji, . . . ,jr} C {!,..., m}, 

5 ^*2 ? • • • ? ^il } ? 5 /^J2 1 • • • 1 f^jr } ) 

denotes the I x r submatrix of Mf obtained by the intersection of the rows of Mf 
that correspond to inputs Oi, , 0 ^ 3 , . . . , Oj, and the columns corresponding to the 
inputs Pj^ ,Pj^,..., Pj^ . 

Observe that one needs an ordering on the elements of U and V in order 
to unambiguously determine Mj for a function f : U x V ^ {0, 1}. Since we 
usually work with Boolean functions one may always consider the lexicographic 
order on the Boolean vectors (binary words). So, for the Boolean function /pr : 
{0, 1}^ X {0, 1}^ — >■ {0, 1} defined by 

fpr{u, u) = 1 iff V is a prefix of u 

(i.e., vi = ui,V 2 = U 2 if u = u\U 2 U^,v = V 1 V 2 ), Mfp, is the following matrix 
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All lower bound techniques we consider are based on some algebraic or com- 
binatorial properties of communication matrices. Let P be a protocol computing 
a function f : U x V ^ {0, 1}. As already mentioned above, any row of Mf (an 
element of U) corresponds to an input of (7/ and any input of Cu (an element 
of V) corresponds to a column of My. So, the work of P is the game where Cj 
knows M f and the row of Mf corresponding to its input and Cu knows M y and 
the column corresponding to its input. The goal is to determine the value on 
the intersection between this row and this column. Note that Cj and Cu do not 
necessarily need to determine the exact position of the intersection of their row 
and column, because if the actual row (seen by Cj) contains only O’s, then C/ 
knows already that the output value is 0. The communication between Cj and 
Cu may be viewed as taking smaller and smaller submatrices of Mf until one 
obtains a monochromatic submatrix of Mf. To support this intuition consider 
that Cj sends the first message ci G {0,1}+ to Cu. Because Cu knows the 
communication protocol, it knows the set S{ci) of rows for which Cj sends the 
message cj. So, Cu knows that the output value lies in the intersection of its 
column and the rows of S{ci). So, the first communication of this game reduces 
Mf to a submatrix My(ci) of Mf consisting of the rows of S'(ci). Next, Cjj sends 
a message C 2 to Cj. Since Cj knows the protocol and the messages ci and C 2 , 
it can determine for which columns Cu sends C 2 if receiving cj. So, the matrix 
My(ci$C 2 ) remaining in the game becomes smaller again. Following this game 
one may suggest that every computation of P determines a set of inputs from 
U X V, that corresponds to a monochromatic submatrix of Mf. So, all compu- 
tations of any protocol P for / partition Mf into monochromatic submatrices. 
We confirm this intuition by using the following fundamental observation. 

Observation 1 Let f : U x V ^ {0,1} be a function, and let P be a protocol. 
For all 01,02 G U, /3i,/32 G y, if P has the same computation C on the inputs 
(oi,/3i) and ( 02 ,^ 2 ), then P has this computation C on inputs ( 01 ,^ 2 ) and 
(o2,/3i), too. 

Particularly it means, that if P accept [rejects] (oi,/3i) and (o 2 ,/? 2 ); then P 
accepts [rejects] ( 01 ,^ 2 ) and (o2,/3i), too. 

Observation 1 provides two powerful techniques for proving lower bounds 
on communication complexity. Figure 1 gives a transparent explanation of the 
following claim. 
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Fig. 1. 



Corollary 1. Let / : C/x C — >■ {0, 1} &e a function that is computed by a protocol 
P. Let, for every computation C of P, Set(C) he the set of all inputs from U xV 
that are proceeded by C . Then, Set{C) corresponds to a monochromatic submatrix 
Mf{C) of Mf and Mf{C) is a 1 -monochromatic [0-monochromatic] submatrix 
of Mf iff C is an accepting [rejecting] computation. 
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Fig. 2. 



Figure 2 depicts a 1-monochromatic submatrix that is the intersection of rows 
001, 010, 100, ill and the columns 000, Oil, 101, and 110 of the communication 
matrix Mf for the function / : {0, 1}^ x {0, 1}^ — >■ {0, 1} defined by 

f{xi,X2,X3,yi,y2,y3) = xi ® X2 ® X3 ® yi ® y2 ® ys- 
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The computation corresponding to this 1-monochromatic submatrix is in- 
volved in the protocol P with P{x\X 2 X 3 ) = x\ (B X 2 ® X 3 and -P(t/ij/2j/3) o$) = 
accept [reject] iff © j/2 © 2/3 © a = 1 [0] . 

Corollary 1 immediately implies that every protocol P computing a function 
/ with d different computations unambiguously determines a partition of My 
into d pair-wise disjoint monochromatic submatrices.® So, the number of com- 
putations of any protocol computing a function / is at least the cardinality of 
the minimal cover of My by pair-wise disjoint monochromatic submatrices. Let 
us formalize this idea in what follows. 

Definition 6. Let M he a Boolean matrix and let S = {Mi, . . . , M^} he a set of 
monochromatic suhmatrices of M. We say that S' is a cover of M if for every 
element aij of M, there exists an m & {!,..., fc} such that atj is an element 
of Mm ■ We say that S is an exact cover of M if S is a cover of M and 
Mr n Mg = 0 for every r ^ s, r,s € (1, . . . , A:}. 

The tiling complexity of M is 

Tiling(M) = min{|S| \ S is an exact cover of M}. 



Theorem 2. For every finite function f : U x V ^ {0,1} , 
cc(/) > [log2[Tiling(My)]] -1. 

Proof. Above we have proved that every protocol P for / determines an 
exact cover S of My, where jSj is the number of different computations of P. 
So, Tiling(My) is a lower bound on the number of computations of any protocol 
computing /. Because of the prefix- freeness property of the protocols, every 
protocol for / must have a computation C of length at least log2 |"Tiling(My)] . 
So, |Com(C)| > log2|"Tiling(My)] — 1 and cc(/) > log2 [Tiling (My)] — 1. □ 

One can observe that the matrix My depicted at Figure 2 can be exactly 
covered by two 1-monochromatic submatrices 

My ({001, 010, 100, 111}, {000, oil, 101, 110}), 

My ({000, oil, 101, 110}, {001, 010, 100, 111}), 

and by two 0-monochromatic submatrices 

My ({000, oil, 101, 110}, {000, oil, 101, 110}), 

and 

My ({001, 010, 100, 111}, {001, 010, 100, 111}). 

So, the communication protocol sending the bit Xi ©0:2 ©2:3 for an input X 1 X 2 X 3 
is an optimal one and cc(/) = 1 for the function / whose matrix My is depicted 
in Figure 2. 

® Note, that a protocol computing / must have exactly one computation for every 
entry of the matrix My. 
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A simple transparent example of the application of the tiling method is prov- 
ing the optimal lower bound on cc(Eq„). Since MEq^ is the diagonal 2” x 2” 
matrix, one needs 2" 1-monochromatic submatrices of to cover the diago- 

nal elements (I’s) and 2” 0-monochromatic submatrices to cover all O’s of Msq . 
So, Tiling(MEq^) > 2”+^, and cc(Eq„) > n. 

Note, that in general, to find an optimal exact cover for a given Boolean 
matrix is a nontrivial combinatorial task and so in some cases the search for 
Tiling(M/) is no simplification of the search for the communication complexity 
of an optimal protocol /. 

The next method may be very practical in some cases because it is a con- 
structive method. To get a lower bound on cc(/) it is sufficient to find a set of 
input instances with some special property. This method is also based on Ob- 
servation 1 . We search for such a set of inputs that any two inputs from this set 
cannot be covered by the same monochromatic submatrix. So, the cardinality 
of every exact cover must be at least the cardinality of this set. In this relation 
this method can be viewed as a constructive method for proving lower bounds 
on Tiling(M/), too. 

Definition 7 . Let f : UxV — >■ {0, 1} be a function. For every S G {0, 1}, we say 
that a set A C U xV is a ^-fooling set for / if, for all {ai,Pi), (a2,P2) G A, 

(i) f{ai,f 3 i) = /(a2,/?2) = S, and 
(a) f{ai,P2) ^ S or f{a2,Pi) yf S. 

We define 

Fool(/) = max{|A| | A is 6 -fooling set for f for some 6 € {0, 1}}. 

Looking at Figure 2 we see that A = {(001, Oil), (011, 100)} is a 1-fooling set 
because /(001,011) = /(Oil, 100) = 1 and /(OOl, 100) = /(Oil, 011) = 0. So, 
there does not exist any 1-monochromatic submatrix of Mf that would cover 
both elements of A. 

Theorem 3 . For every finite function f : U x E — >• (0, 1}, 

cc(/) > {log 2 (Fool(/))]. 

Proof. Let A be a <5-fooling set for /. Following Figure 1 or our consider- 
ation above, no pair of inputs («i, /3i), (a 2 > /? 2 ) of A can be covered by one 5- 
monochromatic submatrix. So, only to cover the elements of A in Mf, one needs 
at least |A| (5-monochromatic submatrices. This is the same as to say that, for 
(5 = 1[0], every protocol for / must have at least |A| different accepting [reject- 
ing] computations. But |A| different accepting [rejecting] computations mean |A| 
different communications. As already observed, every protocol with |A| different 
communications must have a communication of length at least ]"log 2 |A|] . Thus 

cc(/) > [log2 [All 



for every (5-fooling set A for /. 



□ 
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The third method for proving lower bounds on communication complexity is 
based on the rank of communication matrices. Let, for any field F with identity 
elements 0 and 1 and any Boolean matrix M, rankp(M) denote the rank of the 
matrix M over the field F. rank(Af)= rankQF(2)(M) denotes the rank of M 
over the Galois field GF(2), that contains only the elements 0 and 1. We define, 
for every Boolean matrix M, 

Rank(Af) = max{rankF(M) | is a field with identity 

elements 0 and 1}. 



Observation 2 For every Boolean matrix M , Rank(M) = rankg(M), where Q 
is the set of all rational numbers. 



Theorem 4. For every finite function f : U x R — >• {0, 1}, 

cc(/) > |'log 2 (Rank(M/))]. 

Proof. Let P be a protocol computing /, and let F be any field with identity 
elements 0 and 1. It is sufficient to show that the number of accepting compu- 
tations of P is at least rankF(M/). 

Let Cl, G 2 , . . . , Cfe be all accepting computations of P. We already know, 
that, for every i G {1,2,..., k}, Ci determines the 1-monochromatic submatrix 
Mf{Ci) of Mf. Let Mf{Ci) = [ors] be the \U\ x \V\ Boolean matrix defined by 

Ors = 1 iff ttrs is an element of MfiCf) 
for i = 1, . . . , n. Obviously 



k 

i=l 

By the properties of the rank over any field we have 

k 

r&Tokp{M f) < rsnkp{M 

i=l 

Thus, k > rankp{Mf) and to get k accepting computations P must contain a 
communication of at least log 2 k bits. □ 

Now, we look for lower bounds on one-way communication complexity. Here, 
the situation is quite simple. 

Definition 8. Let f : U xV ^ {0,1} be a finite function. A set A C U is called 
a one-way fooling set for f, if, for every a, (3 € A, a ^ /3, there exists a 
j G V such that 



/(a. 7) 
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We define 

One-Way-Fool(/) = max{|^| \A is a one-way fooling set for /}. 

For every Boolean matrix M , we define Row(Ai’) to be the number of different 
rows of M. 



Theorem 5. For every finite function f '■ U x F — >• {0, 1}, 

2=^1 (/) > One-Way-Fool(/) = Row(M/). 

Proof. One can easily observe that two rows a and (3 oi Mf are different if 
and only if there is a column 7 such that /(a, 7) yf /(/3, 7). So, One-Way-Fool(/) 
= Row(My) because every one-way fooling set A for / determines \A\ different 
rows of Mf and, vice versa, any k different rows oi Mf determine a one-way 
fooling set of the cardinality k. 

Let P be an one-way protocol for /. Let P{a) = P{(3) for some a,f3 G U, 
a P (i.e. Cl sends the same message to Cu for both a and /3). Then, for every 
j € V, P must compute the same output for both inputs (a, 7) and (/3, 7) because 
P(7, P{a)$) = P(7, P{P)$). So, P{a) = P{P) can be true only if the rows a and 
P are equal. Thus, the number of different messages (communications) of P must 
be at least Row(My). □ 

We can immediately observe that there may exist an exponential gap between 
cci(/) and cc(/) because there are Boolean matrices with an exponential gap 
between their Row(M) and Rank(M). For instance, for any positive integer 
n, one can construct a 2” x n Boolean matrix M consisting of 2” different 
rows, i.e. containing all n-dimensional Boolean vectors as the rows. Obviously, 
Row(M) = 2" and Rank(M) = n. 

To see the difference for a function from {0, 1}" x {0, 1}” to {0, 1} consider 
the function /ind(n) introduced in Section 2. 

Lemma 1. For every positive integer n = 2*, k G IN — {0}, 

cci(/ind(n)) ^ and ^ 2 ■ log 2 n. 

Proof. A protocol computing /ind(n) within the communication complexity 
2 • log2 n was presented in Section 2. The fact cci(/ind(n)) < n is obvious. 

It remains to show cci(/i„(j(„)) > n. To do this observe that the 2" x2” matrix 
M/inj(„) has 2” different rows. Let a = ai...a„ G {0,1}”, P = P\ . . . Pn G 
{0, 1}", a ^ P he two input parts determining two rows of Mf^^^^^^y Since a yf 
P, there exists an integer j G {!,..., nj such that aj yf Pj. Without loss of 
generality assume ay = 1 and Pj = 0. Let j be the smallest integer with the 
above property. We distinguish two possibilities according to whether j > log2 n 
or j < log2 n. 

If j > log2 n, ai . . . aiog^ „ = /3i . . . Aog^ and so 



a = Number(ai . . . aiog^ „) -I- 1 = Number(/3i . . . Piog^ „) -|- 1. 
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We choose an 7 = 71 . . . 7 „ such that 
Number(7a 7(0+1) 

mod n • ■ ■ T(a+ |"log 2 n] ) mod n) “1“ ^ 3 ' 

Obviously /(o;, 7 ) = 1 and /(/?, 7 ) = 0 and so the rows corresponding to a and 
(3 are different. 

The proof for the case j < log 2 n is left to the reader. □ 

Finally, we deal with proving lower bounds on nondeterministic communi- 
cation complexity. First of all we observe that Observation 1 holds for nonde- 
terministic protocols, too. This means, that if C is an accepting computation 
on inputs (ai,/3i) and ( 025 / 32 ), then C is a possible accepting computation on 
the inputs (oi,/32) and ( 02 , /3i)- Thus we have again the situation that every 
accepting computation corresponds to a 1-monochromatic submatrix. The situ- 
ation differs from the deterministic case in that the nondeterministic protocol has 
possibly several computations on one input and so the monochromatic matrices 
determined by computations may overlap. Thus, any nondeterministic protocol 
determines a cover of I’s of the communication matrix by possibly non-disjoint 
1-monochromatic submatrices. Observe that a nondeterministic protocol does 
not necessarily determine any cover of O’s of Mf. 

Definition 9. Let M be a Boolean matrix, and let S = {Mi , . . . , Mk} be a set 
of 1-monochromatic submatrices of M. We say that S is a 1-cover of M if every 
1 of M is contained in at least one of the 1-submatrices of S. We define 

Cover(M) = minus'! \ S is a 1-cover of M}. 



Theorem 6. For every finite function f '■ U x F — >• {0, 1}, 

ncc(/) = |'log 2 (Cover(M/))]. 

Proof. We have already argued that ncc(/) > [log 2 (Cover(M/))] because 
every nondeterministic protocol with k accepting computations determines a 
1 -cover of My of size k. 

Thus, it remains to prove ncc(f) < [log 2 (Cover(Mf))~j . Let S = {Mi, M 2 , 
. . . , Mfc} be a 1-cover of My, and let k = Cover(My). We construct a nondeter- 
ministic one-way protocol D = (Cf, Cu) for / as follows. For every input a G U, 
Cj looks for all submatrices of S with non-empty intersections with rowo,. Cj 
chooses one of these 1-submatrices Mj and sends the message BIN poga fcl (*) • If 
Cu receives BINpogj k] (*) and column^ of the input (3 GV oi Cu intersect Mi, 
then Cu accepts. Otherwise, Cu rejects. □ 

Observe, that the proof of Theorem 6 provides also an alternative proof of 
the fact ncc(/) = ncci(/) for every /. Since the cardinality of any 1-fooling set 
for / is a lower bound® on Cover(My), we obtain the following result. 

® Remember that no pair of elements of a 1-fooling set can be covered by one 1- 
monochromatic submatrix. 
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Theorem 7. Let f : U x V ^ {0;1} be a finite function. For every 1-fooling 
set A for f, 

ncc(/) > riog 2 1^11 . 

We see that the function Eq„ is hard for nondeterministic protocols because 
one needs 2” 1-monochromatic submatrices to cover the 2" diagonal elements of 
the 2" X 2” diagonal matrix MEq^ . 

For Ineq„ the situation essentially differs. The ones of an 2" x 2" 0-diagonal 
matrix can be covered by 2n 1-monochromatic submatrices 

where Mi (M') is the intersection of all rows whose z-th bit is 1 (0) with rows 
whose z-th bit is 0 (1). On the other hand to cover O’s of Mi„eq„ one needs 2” 0- 
monochromatic matrices. So, this implies the following exponential gap between 
deterministic communication complexity and nondeterministic communication 
complexity. 

Theorem 8. For every positive integer n 

(i) ncc(Ineq„) < [ 21 og 2 n], and 
(a) cc(Ineq„) = n. 

4 Randomized Protocols 

First, we relate Las Vegas and determinism for communication complexity. The 
results show that Las Vegas can be more efficient than determinism, but the 
difference between the power of these two models of computation is not very 
large. One conjectures that this kind of relation should hold for several funda- 
mental computing models and their complexity measures, including the time 
complexity of Turing machines. Unfortunately, one fails in the effort to estab- 
lish such relations for most complexity measures investigated. In this section we 
show that there is an at most quadratic gap between cc(/) and lvcc(/) and a 
linear relation between one-way communication complexity and Las Vegas one- 
way communication complexity. Examples of concrete functions show that Las 
Vegas can do something better than determinism in the framework of relations 
established above. 

First of all we show that there is a polynomial relation between Las Vegas 
communication complexity and deterministic communication complexity. To get 
this result one proves a more powerful result claiming that if both a function / 
and its complement / are easy for a nondeterministic protocol, then / is easy 
for a deterministic protocol, too. In what follows, for every function f : U xV ^ 
{0,1}, the complement of f is the function f defined by 

7(a,/3) = r(/(a,/3)) 

for all (a, (3) G U x V. F denotes the unary Boolean operation called negation. 
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Theorem 9 ([20]). For every finite function f : U x V ^ {0,1} , 

cc(/) < ncc(/) • (ncc(7) + 2). 

Proof. Let ri = ncc(/) and ro = ncc(/). This directly implies that the I’s 
ol Mf can be covered by at most 2’'^ 1-monochromatic submatrices, and that 
the O’s of My can be covered by at most 2’’“ 0-monochromatic submatrices. We 
shall prove that cc(/) < ri • (rp + 2). 

The proof relies on the following property of monochromatic submatrices. Let 
M{R,S) be a 0-monochromatic submatrix, and let M{S',R') be a 1-monochro- 
matic submatrix of My. Then either i? fl i?' = 0 or S' fl S" = 0. The proof for 
this claim is straightforward because if i? fl i?' 7 0 and S fl S' yf 0j then there 
exists a pair (a, /3) G {RnR') x (SflS'), i.e. a pair that belongs to both M{R, S) 
and M{R',S'). However, this is impossible because {a, (3) in M{R,S) implies 
f{a,f3) = 0, whereas (a,/3) in M{R',S') implies f{a,(3) = 1. 

Before starting the proper proof we fix some useful notation. Let C = 
{Cl, ... , Cm}, m < 2’’°, be a set of 0-monochromatic submatrices of My such 
that C covers all O’s of My. Let H = {Fli, . . . ,Hi}, I < 2’’L be a set of 1- 
monochromatic submatrices of My covering all I’s in Mf. Let Ai denote the 
submatrix of Mf formed by those rows of Mf that are contained in Ci (i.e., if 
Ci = M{R, S), then Ai = M{R, V)). Let Bi denote the submatrix of Mf formed 
by the columns of Mf that meet Ci (i.e., if Ci = M{R, S), then Bi = M{U,S)). 
Let int(Aj) and int(Bj) respectively denote the number of 1-monochromatic 
submatrices from H that have a non-empty intersection with Ai and Bi re- 
spectively. Since the intersection of Ai and Bi is exactly the 0-monochromatic 
submatrix Ci, the claim stated above implies that no matrix Hj G H has non- 
empty intersections with both Ai and Bi. So, 

int(Ai) -|- int(Hi) < I = \H\ 

for every i G (1,2,..., m}. Set 

Cl = (Cfe G C I int(Afc) < R/2]},^nd_ 

C 2 = {C, G C I int(H,) < 1/2} = C - Cl. 

Now, we describe the first two rounds of a deterministic protocol D = 
{Dj,Dii) for /. For every input {a, (3) £ U x V, D works as follows. 

Round 1. Dj looks on the row of Mf that corresponds to a in order to see 
whether it intersects any of the 0-monochromatic submatrices in Ci. If so, 
it sends the message “IBINro(j)”, where j is the smallest index such that 
Cj G Cl and Cj intersects the row of a. If row^ does not intersect any 
submatrix of Ci, then Dj sends the message “0”. 

Round 2. If Dii receives “0”, it looks whether the column corresponding to 
its input (3 intersects any of the 0-monochromatic submatrices in C 2 . If so, 
it sends the message “IBINro(fc)”, where k is the smallest index such that 
Cfe G C 2 and Cfc intersects the column of /3. 

Otherwise, Djj sends the message “0”. 

If Dji receives “IBIN(j)”, then it sends “1” to Dj. 
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Now, let us distinguish and discuss three possible situations after the first 
two rounds of D. 

Case 1. The current communication history is 0$0$, i.e. both computers failed 
to find an appropriate 0-monochromatic submatrix. Since Ci U C 2 = C and 
the set C covers all zeros in Mf, we get f{a,(3) = 1. So, both Dj and Du 
know that the output has to be “accept” . 

Case 2. The current communication history is IBIN^ In this case 
both Dj and Du know, that the input {a, (3) belongs to Aj. Since Cj £ 
Cl, int(Aj) < \l/2\ < i.e., all ones in Aj can be covered by 

at most 2'’!“^ = 2”'^/2 1-monochromatic submatrices of A}~. (Note that 
these 1-monochromatic submatrices are all intersections of Ak with 1- 
monochromatic submatrices iJi, i? 2 , ■ ■ ■ ,Hi of Mf.) 

Case 3. The current communication history is 0$BINj.g(fc)$. In this case both 
Dj and Djj know that the input (a, /?) lies in B^. Since Ck £ C2, they know 
that int(Bfc) < 1/2 < 2”^“^, i.e., all ones in Bk can be covered by at most 
2'’!“^ 1-monochromatic submatrices. 

Thus, after this first two rounds either both computers know the output value 
f{a,(3) or both Dj and Djj know that {a, (3) lies in a matrix Mj £ {Ak,Bm} 
whose ones can be covered by at most 2’’^“^ 1-monochromatic submatrices of Mi, 
and whose zeros can be covered by at most 2’~° 0-monochromatic submatrices. 
Following the same communication strategy as described above for Mf in the 
next two rounds for the matrix Mi, Dj and Djj either learn the result f{a,/3) 
or they agree that (a, /3) is in a submatrix M 2 such that 

(i) all ones of M 2 can be covered by at most 2”'^“^ 1-monochromatic submatrices 
of M 2 , and 

(ii) all zeros of M 2 can be covered by at most 2”“ 0-monochromatic submatrices 
of M 2 . 

Continuing in this way, Dj and Djj learn /(a, (3) after at most ri rounds. 
Since every information exchange in rounds 2i and 2i -|- 1 has the length 2 + ro, 
cc{D) < ri ■ {2 + ro). _ □ 

Obviously, lvcc(/) = lvcc(/) for every function /. Since ncc(/) is a lower 
bound on the private Las Vegas communication complexity, there is at most 
an quadratic gap between determinism and private Las Vegas communication 
protocols. Since the relation between the private randomized communication 
complexity and the public one is linear if the randomized communication com- 
plexity is at least l7(log2 n) for Boolean functions of n variables’^, we can express 
the consequences of Theorem 9 in terms of public Las Vegas communication 
complexity as follows. 

Theorem 10. For every Boolean function f : {0, 1}” x {0, 1}” — >■ {0, 1}, 
cc(/) = O ((lvcc(/))2 -k log 2 n) . 

^ See Section 6 for the exact formulation of the relation and its proof. 
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Our next aim is to show, that the at most quadratic gap between determinism 
and Las Vegas presented in Theorem 10 can be achieved for a concrete Boolean 
function. Let, for every positive integer n, n = m?, 

hixAll 2 yi (xr r , Xr 2 ? ■ • ■ 5 * 5 ^m, 2 i ■ • ■ 1 ^m,mi 

■ ■ ■ 5 Vm,! ? ^m ,2 j ■ ■ ■ 5 ym,m) 

be a Boolean function from {0, 1}" x {0, 1}" to {0, 1} defined by 
ExAll 2 „(ai,i, ...,a Pi , 1 5 ■ ■ ■ 5 Pm ,m) — 1 

iff 3j G {1, . . . ,to} such that aj^i, . . . ,aj^m = Pj,i, ■ ■ ■ 

Theorem 11 ([21]). For every positive integers n,m, n = m? , 

(i) cc(ExAll 2 „) = n = m?, and 

(a) lvcc(ExAll 2 „) < 2m([log2m]^ + 1) for sufficiently large m. 

Proof. 

(i) The fact cc(ExAll 2 „) < n is obvious. To prove the lower bound it is sufficient 
to prove cc(ExAll 2 „) > n. Note that 

ExAll2„(aia, . . . ,a Pi , 1 j • ■ ■ 5 Pm ,m) — 1 

iff, for every t G {1, . . . , m}, 0 ^, 1 , . . . , A.i, • ■ • , A.m- To prove the lower 

bound we use the rank method. More precisely, we show that the rows of the 
2" X 2" diagonal matrix are linear combinations of the row of — and so 

rank(Mg^^j^j^) = 2”. Consider the function q 2 „ : {0, 1}” x {0, 1}” — >■ {0, 1} 
defined for all <5i, . . . , Sm, 7i, • ■ • , 7m G {0, 1}™ as follows: 

<i 2 niSiS 2 ■■■Sm, 7l72 • • • 7m) = 

••• ExAll2„(aia2---am,7i72---7m)mod2, 

ai^Si Ct2^S2 am=ASm 

where G {0, 1}™ for t = 1, 2, . . . , m. Our aim is to show that 

q2«(a,/3) = Eq„(a,/3) 

for every positive integer n and all {a, (3) G {0, 1}” x {0, 1}”. Let us prove this. 
First, consider inputs ( 6 , S) = (S 1 S 2 ■ ■ ■ S^, ^162 ■ ■ ■ S^) for any <5^ G {0, 1}™ 
for t = 1, . . . , m. 

q2„(<5, ^) = q2n(<5i<52 . . . ^m, S 1 S 2 ■ . ■ ^m) = (2™ - 1)™ mod 2 = 1 

because there are exactly 2™ — 1 words from {0, 1}™ different from Si and 
ExAll 2 „(w, 5i, . . . , Sm) = 1 for all 

w G {wiW 2 ■ ■ ■ Wm I Wi G {0, 1}™ — {i5i} for i = 1, . . . , to}. 
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Now, consider inputs ( 5 , 7 ) = (<5ii52 • • • 7 i 72 • ■ • 7m), <5j,7* G {0,1}"* for 

i = with S ^ j. Let S{S,^) C such that, for all 

j G S'(<5, 7 ), ^ 7i> and for every k G (1, . . . ,n| - S{6,^), Sk = 7fe. Let 

|S'((5, 7)1 = r. We know that r > 1. Then 

q 2 n{S, 7 ) = (2”* - 2Y ■ ( 2 "* - 1 )"*-" mod 2 = 0 . 

We proved that the communication matrix is the diagonal matrix of 

size 2" X 2”, and so rank(Mq 2 „) = 2”. Following the definition of q 2 « we see 
that every row of is a linear combination of rows of fo^ ExAib • 

i'ank(il% 3 jj^) > rank(Mq,J = 2 ". 

(ii) First, we describe a randomized protocol D = {Di,Dn) for ExAll 2 „ and 
then we analyze its communication complexity. Let Prim™ = |p G IV | p < 
m and p is a prime}. 

Protocol D 

Input: (a, /3) = (0102 ■ • ■ a™, / 3 i ^2 ■ • ■ /3m), cti, (3i G (0, 1}”* 
for i = 1 , 2 , . . . , m. 

Step 1. d = 2 ■ |"log 2 rn\ prime numbers Si, S 2 , ■ ■ ■ , Sd are uniformly chosen 
from Primm at random. 

{ Thus, both Dj and Du know si, S 2 , . . . , Sd-} 

Step 2. The first computer Dj computes Number(ai) mod Sj for all i G 
{1, . . . , m} and all j G {1, . . . , d} and sends the binary representation of 
all m • d results to Du. 

Step 3. If, for every / G (1, . . . , m}, there exists ji such that 
Number(o;j) mod s^. Y Number(/3j) mod Sj^, 
then Du outputs “reject”. 

Else, let fc G {1, . . . , m} be the smallest integer such that 

Number(afc) mod sj = Number(/3fc) mod Sj 

for all j G (1, 2, . . . , d}. Then Du sends the binary representation of k 
to Dj. 

Step 4. When Dj receives the binary representation of an positive integer 
fc, then D[ send ak to Du. 

Step 5. Receiving Djj compares ak with /3fe. 

If Ofe = Pk, then Du outputs “accept”. 

If ak Y Pk, then Du outputs “?”. 

We again use the fact presented in Section 2 that, for all 7 , d G (0, 1}"* and 
sufficiently large m’s, 7 yf d implies 



^ |Prim,„| 



{p G Prim^ I Number( 7 ) mod py^Number(d) mod p} 



2 
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Consider an input {a, (3) such that ExA112„(q;, /3) = 0. This means a = 
aia 2 ---CKm, (3 = /3i/32---Pm and /3i for z = For every i G 

m}, the probability that 

Number(o;i) mod Sj = Number(/3j) mod Sj 

for all j G {1, . . . , d} is 2~‘^. So, the probability that Du recognizes a* yf Pi in 
Step 3 is 1 — 2~‘^. This implies that the probability to recognize yf Pi, for all 
z G m} and so to reject {a, P) is at least (1 — = (1 — > \ for 

sufficiently large m. Obviously, if Du does not reject (a, P) in Step 4, then Du 
must output “?” in Step 5. 

Now, consider ExAll 2 „(a, P) = 1, i.e. there exists a fc G {1, . . . , n} such that 
o-k = Pk- If there are several such numbers, let k be the smallest one with this 
property. Obviously, 



Number(o;fe) mod Sj = Number(/3fc) mod sj 



for all j G {1,2,..., to} and so Djj cannot reject {a, P) in Step 3. If Du sends 
the binary representation of k to Dj, then Dj sends ak to Du and Du accepts 
the input. This scenario fails to work only if Du sends BIN poga m] (j) to Dj for 
some j < k despite of the fact that aj yf Pj . In that case Du outputs “?” . But 
this can happen only with probability at most 



2d' y 2'^) ^ to 2 7 ^ 2 ^ mP 

for every j G (1, . . . , to}. So, 



Prob {D{a, P) 




1 

TO 



Thus, 



Prob (D{a,P) = ’’accept”) >1 . 

TO 



We conclude that D is a Las Vegas protocol for ExAll 2 „. 

It remains to calculate the communication complexity of D. In Step 2, Dj 
sends to • d binary representations of length [log 2 to] and so the length of the 
first message is always exactly to • 2 • |"log 2 to] ^ . If Du sends a message to Dj , 
then this message has the length |"log 2 to] . Then in Step 4 Dj answers with a 
message of length to. So, the longest communication takes 



2 • TO • |"log 2 to] ^ + |"log 2 to] + TO < 2to( |"log 2 to] ^ + 1) 



bits. □ 

Observe that Theorem 8 and Theorem 9 together imply that there is an ex- 
ponential gap between nondeterminism and Las Vegas for communication com- 
plexity. 
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The next question we consider in this section is whether the established 
relation between cc(/) and lvcc(/) is valid also for one-way communication pro- 
tocols. Surprisingly, there is even a linear relation between cci(/) and lvcci(/), 
i.e. determinism is almost as powerful as Las Vegas for one-way protocols. We 
shall show nice applications of this tight relation between cci(/) and lvcci(/) 
for proving polynomial relations between determinism and Las Vegas for other 
computing models in Section 7. 

Theorem 12 ([22]). For every function / : [/ x V — >■ {0, 1} with finite sets U 
and V, 

lvcci(/) > cci(/)/2. 

We have shown that Las Vegas communication complexity is closely related 
to deterministic communication complexity. This is not true for Monte Carlo 
randomization. In this section we show that 

(i) already one-sided-error Monte Carlo protocols may be much more powerful 
than deterministic ones, but the gap between one-sided-error Monte Carlo 
communication complexity and (deterministic) communication complexity 
can be at most exponential, 

(ii) two-sided-error Monte Carlo protocols can be much more efficient than non- 
deterministic ones, and 

(iii) nondeterminism can be much more powerful than Monte Carlo randomiza- 
tion. 

We present the results in the sequence as written above. Observe that the 
power of one-sided-error Monte Carlo randomization lies between determinism 
and nondeterminism. Our first result shows that an exponential gap between de- 
terminism and one-sided-error Monte Carlo is possible for communication com- 
plexity. 

Theorem 13 ([23]). 

(i) For every positive integer n, cc(Ineq„) = n, and 

(ii) lmccc(Ineq„) < 2 [log 2 n] for sufficiently large integer n. 

Proof. 

(i) We showed already that all three lower bound techniques (Tiling, Rank, 
Fooling sets) provide cc(Ineq„) > n. 

(ii) In Example 1 we proved lmccc(Ineq„) < 2 • [log 2 n] for sufficiently large n. 

□ 

A natural question is whether there is a possibility to bound the difference 
between cc(/) and lmccc(/) for all functions / (i.e., whether a larger gap than 
the exponential gap presented in Theorem 13 is possible). The following theorem 
shows that the gap cannot be larger than exponential because nondeterminism 
can be simulated by determinism with an exponential blow-up of communica- 
tion complexity. Since one-sided-error Monte Carlo protocols can be viewed as 
restricted nondeterministic protocols, we are done. 
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Theorem 14 ([24]). For every finite function f : U x V ^ {0,1}, 

cciif) < 

Proof. We know that ncc(/) is exactly the cardinality of an optimal cover 
of I’s of My by 1-monochromatic submatrices of My. Let {Mi, M 2 , . . . , M„(.c(y)} 
be such an optimal cover. One can construct a (deterministic) one-way protocol 
P = {Ci, Cii) that computes / as follows. 

Input: A pair (a,/3) G U xV. 

{ a is the input part of C/, and (3 is the input part of C//.} 

Step 1. For every input a G U, Cj sends the message 

did2 ■ ■ ■ d^cc(f) G ( 0 ) !}“'=(/) 

to Cii, where, for all i G (1, . • . ,ncc(/)}, di = 1 if rowa has a nonempty 
intersection with the 1 -monochromatic submatrix M^. 

{ In this way Ci tells Cu the names of all 1-monochromatic submatrices 
that candidate to contain the element {a, (3) from the Ci point of view.} 
Step 2. After receiving d\d 2 ■ ■ ■ dncc(/)> Cn accepts, if there exists an t G {!,..., 
ncc(/)| such that di = 1 and column ,3 has a nonempty intersection with the 
1-monochromatic submatrix Mi. Otherwise, Cn rejects. 

Obviously, P computes / and its communication complexity is exactly 
ncc(/). □ 

The next result continues to demonstrate the power of Monte Carlo random- 
ization by showing that two-sided-error Monte Carlo protocols may be much 
more efficient than their nondeterministic counterparts. 

Theorem 15 ([23]). For every positive integer n, 

(i) ncc(Eq„) = n, and 

(a) 2mccc(Eq„) < 2 [log 2 n] for sufficiently large n. 

Proof. 

(i) The fact ncc(Eq„) < n is obvious and the lower bound ncc(Eq„) > n was 
already observed as an application of the 1 -fooling set technique and the 
cover technique in Section 3. 

(ii) Consider the one-sided-error Monte Carlo protocol R of Example 1 for Ineq„. 
If one modifies i? to i? in such a way that R accepts iff R rejects, then R 
computes Eq„ = Ineq„. This is because, for every input (a,/3) G (0, 1}” x 
{ 0 , 1 }”, 

(1) if a = P (i.e. Eq„(a,/3) = 1), then 

Prob{R{a, P) = ’’accept”) = 1, 

(2) if a P (i.e. Eq„(a,/3) = 0), then 

2 In 



Prob{R{a, P) = ’’reject”) > 1 



n 




28 



Juraj Hromkovic 



Thus, for sufficient large n, i? is a two-sided-error Monte Carlo protocol for 
Eq„. Since R works within 2 • |"log 2 n] communication complexity, R works 
with 2 • [log 2 n] communication complexity, too. 

□ 

Observe that Theorem 15 implies that there is an exponential gap between 
two-sided-error Monte Carlo randomization and one-sided-error Monte Carlo 
randomization for communication complexity, too. 

The following result together with Theorem 15 shows that bounded-error 
Monte Carlo randomization and nondeterminism are incomparable. For some 
problems one computing mode can be essentially better than the other one, and 
for another problem the situation may be vice versa. To see it, consider the 
following Boolean function 

Disj„(xi,...,x„,yi,...,y„) : {0,1}” x {0,1}” {0,1} 

defined by 

Disj„(ai,o;2,---,an,/3i,/32,---,/3n) = R ^min 0^/3*, 1 

So, Disj„(ai, 02 , • ■ • , On, /3i, /32, ■ • ■ , /?n) = 1 iff = 0- other words, 

Disj„(ai, « 2 j • ■ • 5 o.n, Pi,P 2 , ■ ■ ■ , Pn) = 1 iff there exists a j G {1, 2, . . . , n} such 
that aj = f3j = 1. 

Theorem 16 ([25,26,27]). For every positive integer n 

(i) ncc{Disj„) < [log 2 n], and 
(a) 2mccc{Disj 

There are several further papers devoted to randomized protocols (see, for 
instance, [3,4,28,29,30,31,32,33,34]). For all we mention here only that Ablayev 
[28] showed a nice combinatorial characterization of Boolean functions for which 
Monte Carlo one-way protocols cannot be much better than the one-way deter- 
ministic protocols. 

5 A Bound on the Number of Random Bits 

The number of random bits used is an important characteristic of every ran- 
domized algorithm. To consider the number of random bits as a complexity 
measure of randomized computation, especially in some tradeoffs with other 
computational complexity measures, is one of the central research tasks on ran- 
domization. There are at least two important general reasons for this. 

(i) The random bits are not for free and the cost of producing a sequence of 
random bits grows essentially with the length of the sequence. 

(ii) If the number of random bits used in a randomized computation is not too 
large, then there is a possibility of an efficient derandomization (i.e., one can 
create an efficient deterministic algorithm doing the same job as the original 
randomized one). 
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Besides this two general reasons above we have one more special reason to 
investigate the number of random bits for communication protocols. As already 
discussed in Section 2, we have the two fundamental possibilities to randomize 
protocols, namely either by using one common random source (public random- 
ized protocols) or by using two independent random sources (private randomized 
protocols). The difference between these two randomization approaches is that 
to simulate public randomized protocols by private ones, it may happen that 
the private random bits need to be communicated. Thus, any upper bound on 
the number of random bits is also an upper bound on the possible difference be- 
tween the communication complexity of public randomized protocols and private 
randomized protocols. 

In this section we present the result of Newman [18] who showed that 
0(log2 n) bits are enough to exploit the full power of bounded error (Las Vegas, 
one-sided-error Monte Carlo, two-sided-error Monte Carlo) protocols. 

In what follows, for any finite function /, we denote by priv-lvcc(/), priv- 
lmccc(/), and priv-2mccc(/) resp. the private counterparts of the public 
randomized communication complexities lvcc(/), lmccc(/), and 2mccc(/) re- 
spectively. 

Theorem 17 ([18]). Let D he a public x-randomized protocol for a Boolean 
function f : {0, 1}” x {0, 1}” — >■ {0, 1}, n G IN, where x G {Las Vegas, one- 
sided-error Monte Carlo, two-sided-error Monte Carlo}. Then, there exists an 
equivalent public x-randomized protocol D' for f that uses at most 0(log2 n) 
random hits and 

cc(D') = 0{cc{D)). 

Proof. Because the error probability of randomized protocols can be de- 
creased to an arbitrary constant by a few repetitions of the work of randomized 
protocols, we may assume that there is a protocol D that computes / with error 
probability at most | and cc{D) = 0(cc(L>)). The idea of the proof is to use 
some kind of derandomization of D in order to construct D' that computes / 
with error probability at most | and 0 (log 2 n) random bits. 

Let LI be the set of all random sequences u^d by D, and let Proh be the 
probability distribution over LI . So, we can write D = {Proh, {Dj.\r G LI}), where 
Dr is the protocol corresponding to r. Let Z{a,j3,r) be a random variable that 
gets the value 1 if D gives the wrong answer (“?” in the case of Las Vegas 
protocols) on the input (a, /?) G C/ x V, if choosing the random sequence r 
(i.e. the (deterministic) protocol Dr corresponding to r gives a wrong answer). 
Otherwise (if Dr{a,f3) = /(a,/3)), Z{a,(i,r) = 0. Because D computes / with 
error at most we have 

Er(in [Z{a, P, r)] = ^ Proh{r) ■ Z(a, P,r) < ^ (3) 

r^n 

for all {a, P) G U x V. 

Our aim is to show that there exist t = 36n deterministic protocols ■ 

,D{ in [Dr\r G 77} such that D' = {Pr, [D[, D' 2 , ■ . ■ , D{}) with Pr{Di) = 
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Pr{Dj) = for all i,j £ {1,2, ... ,t}, computes / with error at most To 
prove the existence of D[, D' 2 , . . . , D'^ we use the probabilistic method. 

Let Dri,Dr 2 , ■ • ■ , Dr^ be arbitrary t (deterministic) protocols from {Dr \ r £ 
IT}. Consider the following randomized protocol Dr^r 2 ...rt = Dr^, ■ ■ ■ 

, Dr^}) with the uniform probability distribution Pr. We use 

t t 

Epr[Z{x, y, n)] = ^ Pr{Dn) ■ Z{x, y,r,) = '^-- Z{x, y, n) 

i=l i=l ^ 

to denote the expected error of Drjr 2 ...rt- By the Chernoff inequality and the 
bound (3), we obtain 

Prob 

for every (a, P) £ U x V. The inequality (4) implies that, for a random choice 
of Ti,. . . ,Tt, the probability that Epr[Z{x,y,ri)] > g + g = | for some input 
(a,P) is smaller than 2^” • 2“^" = 1. This implies that there exists a choice 
of ri,...,Tt where, for every (a,P) £ U xV, the error Epr[Z{x,y,ri)] of the 
protocol Drir 2 ...rt is fo most 

Since the protocol D' is a uniform probability distribution over t deterministic 
protocols, the number of random bits of D' is 

riog2^1 = rfog2(36n)l < 5- riogan]. 

The communication complexity of D' is at most the communication com- 
plexity of D, and so® cc(D') = 0(cc(D)). □ 

A direct consequence of Theorem 17 is a linear relation between private ran- 
domized communication complexities and the corresponding public randomized 
communication complexities, provided that the communication complexities are 
at least of order log 2 n. This result is formulated in the following theorem. 

Theorem 18 ([18]). For every x £ {lv,lmc,2mc}, and for every Boolean func- 
tion f : {0, 1}" X {0, 1}” {0, l},n£lN - {0}, 

priv-xcc{f) = 0{xcc{f) log 2 n). 






i=l 



< < 2 "^" 



(4) 
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Abstract. UMDA (the univariate marginal distribution algorithm) was 
derived by analyzing the mathematical principles behind recombination. 
Mutation, however, was not considered. The same is true for the FDA 
(factorized distribution algorithm), an extension of the UMDA which 
can cover dependencies between variables. In this paper mutation is 
introduced into these algorithms by a technique called Bayesian prior. 
We derive theoretically an estimate how to choose the Bayesian prior. 
The recommended Bayesian prior turns out to be a good choice in a 
number of experiments. These experiments also indicate that mutation 
increases in many cases the performance of the algorithms and decreases 
the dependence on a good choice of the population size. 

Keywords: Bayesian prior, univariate marginal distribution algorithm, 
mutation, estimation of distribution algorithm 



1 Introduction 

The Estimation of Distribution Algorithms EDA have been proposed by Miihlen- 
bein and Paass [7] as an extension of genetic algorithms. Instead of performing 
recombination of strings, ED As generate new points according to the probability 
distribution defined by the selected points. The crucial problems are: How to 
estimate the probability distribution of the selected points and how to generate 
new points? 

In [6] Miihlenbein showed that genetic algorithms can be approximated by an 
algorithm using univariate marginal distributions only ( UMDA) . In [9] Miihlen- 
bein and Mahnig have extended this algorithm to general distributions. The cor- 
responding algorithm was called the Factorized Distribution Algorithm (EDA). 
For this algorithm convergence was proven to the set of global optima if Boltz- 
mann selection is used and the size of the population is large enough [12]. Con- 
vergence means that the dynamical system defined by the algorithm converges 
to stable attractors. 

UMDA and EDA do not use any kind of stochastic mutation. Variation is 
achieved by a population of individuals. For each problem there exists a critical 
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population size which is needed for convergence to the optima. The determina- 
tion of a sufficient size of the population turned out to be difficult. In this paper 
mutation is introduced. Mutation will increase the number of generations needed 
to converge, but the algorithms can use a smaller population size. 

The outline of the paper is as follows. In Sect. 2, the univariate marginal 
distribution algorithm UMDA is defined. In Sect. 3, the connection between 
mutation and using a prior distribution for the estimates of the probability dis- 
tribution of the UMDA is established. By computing an upper limit on the 
Bayesian prior, we derive a recommendation on how to choose the prior param- 
eter (or equivalently, the mutation rate). Numerical results for the UMDA are 
presented in Sect. 4. 

In Sect. 5, the theory is extended to FDA. The recommended prior param- 
eters are tested in Sect. 6 and finally a comparison is made between the FDA 
without mutation and the FDA with mutation. 

2 The Univariate Marginal Distribution Algorithm 

Let X = (xi, . . . , Xn) denote a vector, Xi G {0, 1}. We use the following conven- 
tions. Capital letters Xi denote variables, small letters Xi assignments. For sim- 
plicity we only consider binary variables here. Let a function / : {0, 1}" — >■ lR>o 
be given. We consider the optimization problem Xopt = argmax/(x). 

The key idea is to consider a population as a sample of a probability distri- 
bution p(x,t). Selection in the analysis is then an operator that calculates the 
selected distribution p®(x, t) from the previously generated distribution p(x,t). 
In the algorithms it can be implemented by choosing a set of points from the 
previously generated ones, possibly taking some points several times. This se- 
lected set often has cardinality M smaller than N, the population size, but this 
need not be the case. 

Definition 1. Let denote the probability ofx in the population at gener- 
ation t. Then pi{xi,t) = X)x-x =a: defines the univariate marginal distri- 

butions of variable Xi . 

In order to estimate the probabilities from occurences in a population, we 
need to make an assumption on the distribution, otherwise the population size 
needed to make an accurant estimate would be exponentially high. The simplest 
probability distribution is p(x,t) = YYi^iPi{xi,t)- K is called Robbins’ propor- 
tions or linkage equilibrium in population genetics and the mean field approach 
in physics. 

The Univariate Marginal Distribution Algorithm (UMDA) [6] approximates 
genetic recombination by computing 

n 

p(x,t-h 1) = (1) 

directly instead of selecting parents for every variable. Thus the population is 
kept in linkage equilibrium. 
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UMDA 

STEP 0: Set t <= 1. Generate N ^ 0 individuals , X^ randomly. 

STEP 1: Select M < N individuals X^ from X^ according to a selection 
method. Compute the sample marginal frequencies pf{xi,t) of the selected 
set. 

STEP 2: Generate N new points according to the distribution 
p(x,t + 1) = Set t <i= t + 1. 

STEP 3: If termination criteria are not met, go to STEP 1. 

There are many possibilities to select points according to fitness. In this pa- 
per, we consider proportionate selection and truncation selection. Let f{t) = 
be the average fitness. 

Definition 2. Proportionate selection changes the probabilities according to 

p'*(x,t) =p(x,t)^^ (2) 

Note that in UMDA with proportionate selection, we can also directly adjust the 
marginal frequencies according to the fitness values. 

Truncation selection with parameter 0 < r < 1 selects the t- N individuals 
with highest fitness. 

UMDA formally depends on 2n parameters, the marginal distributions 
Pi{xi,t). We now interprete the average f{f) = '^„,p{x,t)f{x) as a function 
which depends on pi{xi,f) because of the linkage equilibrium assumption. As 
Pi{0,t) = 1 — pi{l,t), we eliminate half of the parameters and by setting 
Pi{t) := pi{l,t), we write 

W{t) = IT(pi(t),...,p„(t)) := f{f) (3) 

We can now formulate difference equations, describing the dynamic behaviour 
of p^{f). 



Theorem 3. For an infinite population and proportionate selection UMDA 
changes the gene frequencies as follows: 

1) =p*(t) (4) 

The stable attractors of Wright’s equation are at the corners, i.e. Pi € {0,1} 
for i = l,...,n. In the interior there are only saddle points or local minima 
where grad IT (p)) = 0. The attractors are local maxima of f{x) according to 
one bit changes. Wright’s equation solves the continuous optimization problem 
argmax{IT(p)| by gradient ascent. 



The theorem has been proven in [6]. Equation (4) was already conjectured by 
Wright [14]. In populations genetics it is now called Wright’s equation. 
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3 Mutation and the Bayesian Prior for UMDA 

We will introduce mutation into UMDA by a concept called Bayesian prior. 
There are two possibilities to estimate the probability of “head” of a biased 
coin. The maximum-likelihood estimate counts the number of occurrences of 
each case. With m times “head” in N throws, the probability is estimated as 
p= 

The Bayesian approach assumes that the coin is determined by an unknown 
parameter 0. Starting from an a priori distribution of this parameter, the prob- 
ability distribution of this parameter is modified by observing coin flips (data) 
using Bayes’ rule. The usual distribution chosen for the binomial experiments is 
the family of Dirichlet distributions. In practice, this means that the estimated 
probability becomes p={m + r)/{N + 2r) and the so called hyperparameter r 
has to be chosen in advance [3] . 

Mutation in genetic algorithms works in the following way: When generating 
new individuals, with a probability of p the generated bit is inverted. 

The following theorem relates mutation with the Bayesian prior. 

Theorem 4. For binary variables, the expectation value for the probability using 
a Bayesian prior with parameter r is the same as mutation with mutation rate 
p = r/{N + 2r) and using the maximum likelihood estimate. 

This can be proven simply by calculating the probability of generating a 
particular bit for both cases. 



3.1 A Bound for the Bayesian Prior 

A Bayesian prior moves the attractors from the corners into the interior. For 
r — >■ oo there is a unique attractor at p = 0.5. Using this prior UMDA is a purely 
random algorithm. Thus the question arises how to choose the prior parameter r 
or equivalently, the mutation rate p, so that an optimum will be generated with 
a certain probability. The following observation allows us to find an upper limit. 

Let the optimum be (1,...,1). When the population consists only of this 
optimum, new individuals would be generated with probability 

/ + r \ ” 

P,:=p(Ai = l,...,A„ = l)=(^^^^j . (5) 

In order that the optimum remains an attractor, the probability has to be greater 
than a constant, say 0.3. For r = ^/n we have 



/ N + r\ 
\N + 2r) 






> 0.3 



(6) 



as the expression on the right side is monotonically converging towards e~^. So 
for this choice of r, the population can get sufficiently close to the optimum. If 
r increases, the probability of generating the optimum decreases exponentially. 
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This bound has been derived with an unrealistic model. In reality the at- 
tractors are defined by a dynamic equilibrium between selection (driving the 
population to the corner) and mutation (driving the population to the interior) . 
This problem is investigated next. 



3.2 Wright’s Equation with Bayesian Prior 

With mutation, the dynamical equations for proportionate selection have to be 
extended. When using the Bayesian prior, we get for binary variables 






2r 



N + 2r N + 2r 



■p- w, 



( 7 ) 



where pf (t) is given by (4) . Combining these equations, we get 

Theorem 5. Wright’s equation with Bayesian prior instead of maximum like- 
lihood estimates is given by 



p^{t-^-l) =pi{t) -i- p,{t) 



[l-pi{t))dW r 
W{t) dpi N -\-2r 

2r (,.. {l-p^{t)) dw" 



(8) 



N + 2r 






W (t) dpi 



The attractors are given by p* := Pi{t 1) = Pift). They are moving towards 
Pi = V2 increasing r. 

Consider the function /(x) = OneMax(x) = We will apply (8) to 

the maximization problem maxx OneMax(x). Because the equation is identical 
for all Pi, we have pi = pj =: p for all i,j. Then we obtain W = n-p and dW/dp = 
1 (see [11] for details how to calculate the average fitness and derivatives). 



Corollary 6. The attractors of UMDA with proportionate selection for the 
function OneMax with bit length n and population size N are given by 



1 -I- rn/N 
1 -I- 2rn/N 



(9) 



Choosing r = ^ yields a stable attractor with p* = 2/3 and an expectation 
value / = '^/sn, while the optimum has value n. A simulation run of UMDA 
with n= 10, A^= 1000 and r = 100 had as average marginal frequency in the last 
generation pi = 0.669. So a large-population simulation behaves similarly to the 
infinite population models from Corollary 6. 

Thus this Bayesian prior is too large. In fact, a much smaller prior has to be 
chosen. A prior of r = N /n^ (corresponding to a mutation rate of 1 /n^) yields 
a value of p* « 1 — (for large n). In this case, we have lim„_>oo Ps(xmax) = 
e~^ > 0.3 and / = n{n -I- l)/(n -I- 2) « n. 
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3.3 Truncation Selection and the Bayesian Prior 

Wright’s equation has been derived under the assumption of proportionate se- 
lection. But proportionate selection is far too weak for optimization. Truncation 
selection is much more efficient. Unfortunately a general dynamic equation for 
truncation selection has not yet been found. Again, the difference equations are 
equal for all pi, so p := pi. Miihlenbein [6] has derived the dynamical equations 
for the function OneMax as 

p{t+l) =p"{t) -p{t) + ^^Jn-p{t){l-p{t)) (10) 

where the selection intensity, depends on the truncation threshold r [6]. It 
is defined as Ir = S{t)/a{t), the quotient of the selection differential and the 
standard deviation of the population, where S := fs{t) — f{t) and fs{t) is the 
average of the selected points. Using a normality assumption. It depends only 
on r [2]. 

As the Bayesian estimator is used only for the selected points, we have to use 
M = tN instead of N in the formulas. Thus when using the Bayesian estimate 
after selection, we get 



p{t+l) 



p^{t)M + r 
M + 2r 



( 11 ) 



Using (10) we get: 

Theorem 7. For UMDA with truncation selection we obtain for the function 
OneMax the dynamical equation 



p{t + 1) = p{t)+^-^^n-p{t){l-p{t)) + 

Setting p{t -|- 1) = p{f), the stable attractor is given by 



(12) 



- 1 ^ _ 1 ^ 1 

^ 2 2a// 2M2 -b 4nr2 2 2^1 + Ip^/n 



(13) 



with p := {r ■ n)/{IrM). 

The attractor depends only on p and n. Comparison with actual simulation 
runs gave excellent results. We just show two runs in Table 1. 

An attractor in the interior is good because it reduces the probability of 
fixation. But we have the constraint that the optimum should be generated with 
a high probability. In Fig. 1 we show the probability to generate the optimum 
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Table 1. Comparing the attractor from (13) with a simulation run for OneMax with 
population size 2000. 





n 


Theory 


Run 


n 


Theory 


Run 


% 

p 


32 


0.9714 


0.967 


64 


0.9851 


0.983 



Ps := (p*)” versus p. A constant value of p keeps the probability high enough 
for arbitrary n. Thus we obtain the following rule of thumb: 

For truncation selection use a value of p = 1, corresponding to r = IrM/n or 
p = l/[n/(/T-r) +2]. This leads to the UMDA-M algorithm, or UMDA with 
mutation. 




Fig. 1. Value of Pa, the probability to generate the optimum, when varying p for 
different bit lengths using formula (13). 



Remark: The algorithm with Bayesian prior can run with a very small popu- 
lation size. Setting N = X and M = 1, we obtain the (1,A) evolution strategy. 
By using the approximation « In A/ln2, we get r « lnA/(nln2). 

4 Numerical Results for UMDA 

The preceeding analysis was done using many simplifications. In this section we 
compare actual simulation results with the theoretical results. In addition we 
investigate a difficult multi-modal function by simulations. 

4.1 Simulation Results for OneMax 

Our goal is to show that our rule of thumb is correct for OneMax. p depends on 
M, Ir, r and n. 
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In Fig. 2, we examine the behaviour for a fixed number of bits (100) and 
for population sizes between 10 and 70 with a truncation threshold of r = 0.3. 
Plotted are the number of function evaluation for successful runs until 10% of 
the population consisted of the optimum against the prior parameter r scaled by 
n/{IrM), where M = tN is the number of selected points. Only points where 
the success rate was higher than 90% are shown. 




Fig. 2. Function evaluations with varying prior for different population sizes with 
OneMax and 100 bits. Shown are the number of function evaluation for successful 
runs (when the success rate was above 90%), against the scaled prior parameter. 



When the mutation rate is too high, the number of function evaluations 
increases rapidly, as the probabilities are shifted towards 1/2 too much. 

When the mutation rate decreases, the number of evaluations increases also, 
because it becomes too improbable to flip the remaining bits that are wrong. On 
the far left the success rate got below 90%. 

It is remarkable that for all four population sizes, with r = ItM/u, roughly 
the same number of function evaluations are needed. As this number is the 
product of the number of generations and the population size, this implies that 
the number of generations needed decreases at roughly the same rate as the 
number of individuals increases. 

Similar results have been obtained for varying r and varying n. The simula- 
tion results confirm the theoretical result. 

To summarize: For OneMax with truncation selection, taking p ps 1 is a 
good choice for a range of population sizes, selection intensities and bit lengths. 
Despite big changes in the number of generations, the number of function evalua- 
tions remains remarkably constant for a given number of bits when the population 
size and the truncation threshold are varied within a reasonable range. 
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4.2 Simulation Results for Saw 

The definition of Saw can be seen from figure 3. A local search has to cross many 
valleys of depth 2 in order to get to the optimum. Thus Saw is very difficult to 
optimize for any kind of local optimization [8]. In contrast, Saw is surprisingly 
easy to optimize for population based search methods like genetic algorithms and 
UMDA. The reason is that the fitness landscape defined by the average fitness 
W is smooth and has only a single valley shortly before the optimum W = 8 
(see [11] for a discussion). 




Fig. 3. Definition of the Saw function 



In Fig. 4, we can see results of using different prior parameters for this func- 
tion. We varied the size of the problem and set M = ^Jn. This approximation 
was proposed in [10]. The truncation threshold was set to t = 0.25. 




Fig. 4. Function evaluations for several bit lengths and population sizes for the Saw 
function with success rate > 90%; t = 0.25. 
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We see that p « 1 is a good choice for this function also. 

UMDA has been tested with many other functions also. In all cases p ~ 1 
turns out to be a good recommendation. 



5 Bayesian Prior and FDA 

UMDA is an efficient algorithm for many multi-modal functions. But when in- 
teractions between variables have to be considered, it might fail. Thus we have 
extended it to the FDA, the factorized distribution algorithm [11,12]. 

The idea of the FDA is to chose a different factorization for the underlying 
probability distribution. Instead of p(x) = YliPii^i)’ have the more general 
expression 



P{x) = W_P\x^\'Ki) = W_p\Xi\Xi^,...,Xi,^,_^) (14) 

i=l i=l 

where each tt^ = . . . ,Xi^,_,,) is are subset of (xi, . . . ,x„). They are chosen in 

such a way that interactions in the function definition are represented and that 
a valid probability distribution is generated. This can be done using an additive 
decomposition of the function [10] . An additive decomposition is a representation 
/(^) = fiixijTTi), where each fi depends only on the subset (xi,TTi) of the 
variables. 

In the context of Bayesian networks, these factorizations can be represented 
by a directed acyclic graph where the are called the parents of each node 
Xi- For an introduction in modelling probability distributions using Bayesian 
networks see [13,4]. 

If the function is additively decomposed and separable (the Si have empty 
intersection), then FDA is equivalent to UMDA with multiple alleles. For pro- 
portionate selection an extension of Wright’s equation can be obtained [10]. This 
equation can be extended to include mutation as before. The dynamical equa- 
tions for other selection methods are very difficult, therefore we use heuristics 
to derive some bounds for the mutation rate. 



5.1 Calculating a Bound for the Prior Parameter for FDA 

For UMDA, we have seen the equivalence between using mutation and using a 
Bayesian prior. There seems to be no obvious reason to prefer one to the other. 
This is different for FDA. The concept of linked variables leads to the need of a 
mutation operator that is in accordance with the structure of the distribution. 

The theory of Bayesian estimates can be extended to the cases of conditional 
probabilities [1]. The chain rule of conditional probabilities says that 



p{xi, ...,Xk)= p{xi) ■p{x2\xi) ■p{xz\xx,X2) ■ ■ ■ p{xk\xi, . . .,Xk-l) (15) 
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Using the Bayesian priors, we get the following equation for binary variables xf. 



N{xi, . . . ,Xk) + r' _N{xi) + ri N{xi,X 2 ) + r 2 N{xi, X 2 , X 3 ) + 

N + 2’^r' N + 2ri N{xi)+ 2 r 2 N{xi,X 2 ) + 2 x 3 

AT( \ m 

N[xi, ...,Xk) + rk 
N{xi, . . .,Xk-i) + 2vk 

where N{-) = N ■ p{-) is the number of occurrences in the population and the 
denominators where chosen in such a way that p{xi, . . . ,Xk) = 1 and 

In order for (16) to hold, the fractions on the right hand side have to cancel 
each other out. We get the following identities for the parameters: 

2n = Ti-i => Ti = and r' = rt = (17) 



Thus we get 

Lemma 1. When r is the prior for a single binary variable, the prior r' for a 
factor p{xi, ... ,Xk) and the prior r* for a factor p{xk\xi, ■■ ■ ,Xk-i) should be 

U = r* = • r (18) 



We now derive an estimate for r' as before. Let the probability distribution 
be the product of marginal distributions of ki variables each. 

Rule of thumb: Using with r = lT.M/n is a reasonable choice 

for the Bayesian prior for truncation selection. 

This result can be obtained from the following argument: When all individ- 
uals are equal to the optimum, using (14), the probability of generating the 
optimum is approximately 




M-br' 
M + 2+- 




(19) 



where r' = 2 If we set r = IrMjn we get 



i 



ft=n 

2 = 1 

for r > 0.4. 



n + 2Ir 






71 2ilj- 



( 20 ) 



6 Numerical Results 



As a first simple example for the FDA we use the separable deceptive function 
of order m. It is defined as follows: 



DeCjji (x) 



m — 1 — |a;| 0 < |a:| < m — 1 

m \x\ = m 



(21) 
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where |a;| = The global maximum is isolated ata; = deceptive 

function of order n is a needle in a haystack problem. This is far too difficult 
to optimize for any optimization method. We simplify the optimization problem 
by adding 1 distinct Dec4-functions to give a fitness function of size n = I ■ 4. 
This function is also deceptive for population based search algorithms. The local 
optimum x=(0 ,..., 0 )is surrounded by good fitness values, whereas the global 
optimum is isolated. 

i 

DeC(i^ 4 )(x) = '^DeC4(x4^-3,X4i-2, ■ ■ ■ ,X4i) (22) 

_Dec(; 4) is separable. The corresponding FDA factorization is given by 

i 

p{x,t) = Y\_Pi(x4i-3,X4i-2, ■ ■ -,X4i,t) 
i=l 

Thus FDA is identical to UMDA with 16 states (alleles) at each position. Thus 
these two algorithms behave similarly [9]. In Table 2 the attractors of UMDA 
and OneMax are compared to the attractor of FDA and Dec(; 4). It compares 
the average marginal frequencies Pi{l) for OneMax (taken from Table 1) with 
the same frequencies for the Dec4-function. 



Table 2. Attractor for OneMax and Dec(i^ 4 ) for r = DM = 1/n together with the 
limiting value when the population consists only of the optimum. 





n 


OneMax 


Dec(s^4) 


n 


OneMax 


Uec(ig_4) 


p* 


32 


0.967 


0.965 


64 


0.983 


0.982 



As the blocks consist of four binary parameters, the recommendation for the 
prior parameter is r' = Setting r = IrM/n we obtain r' = It-M/Sti. 

All simulations confirm that r' = IrM/8n is not the optimal point, but it is a 
reasonably good choice. Overall our results clearly show that separable functions 
pose no difficulties. 

6.1 Using a Prior versus a Larger Population Size 

A Bayesian prior reduces the fixation problem and therefor the probability of 
finding the optimum. This comes at a price, however. The search using muta- 
tion needs more generations than without mutation. So, the number of function 
evaluations increases. In order to make a fair comparison, we have to compare 
the case with prior to a run without prior, but with a bigger population size. 

In Table 3, these two cases were compared for several test functions. The 
population sizes were chosen to give comparable success rates. 

The Kaufman function is derived from Kaufman’s n — fc-model [5]. These 
functions consist of n binary variables and n sub-functions fs^ depending on 
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Xi and k further variables. The function values of the sub functions are chosen 
uniformly random from the interval [0, 1]: 

n 

Kaufmann,k{x) = \si\ = k-\-l, i G Si (23) 

In our case, the functions fsi were chosen to be dependent of Xi and the adja- 
cent variables Xi-i and Xi+i. For this special case, the global optimum can be 
calculated in linear time by dynamic programming. The chain structure leads to 
the probability distribution 

p{x) = p{xi)p{x 2 \xi)p{xz\xi,X 2 )p{xa\x 2 ,xz) ■ ■ ■ (24) 

Experiments have shown that the recommended prior is again a good choice, see 
figure 5. 




Fig. 5. Kaufman’s n — fc-function with k = 2 and adjacent neighbourhood. Only points 
with success rate above 90% are plotted. 



A more difficult example is the function IsoPeak. It is defined as follows: 



i 

IsoPeak(x) = ^ IsOi(x 2 z-l, X 2 i) +Is 02 (xi,X 2 ) 

Z=1 



with 



X 


00 


01 


10 


11 


Isoi 


1 


0 


0 


1-1 


IS02 


0 


0 


0 


1 



(25) 



(26) 



and n = I + 1. This function is very hard to optimize. The global optimum is 
(1,1,...,!) with value F (I — 1) -I- 1. This optimum is triggered by Iso 2 - It is very 
isolated, the second best value is (0,0,..., 0) with value l{l — 1). The associated 
factorization is 



p{x) = p{xi, X2)p{x3\x2)p{x4\x3) ■ ■ ■ p{Xn\Xn-i) 
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Table 3. Performance comparison with and withont prior for suitably chosen popula- 
tion sizes and truncation selection with r = 0.3. 





1 With Prior I 


1 Without Prior I 


Name 


n 


N 


Succ 


FE 


a 


N 


Succ 


FE 


a 


OneMax 


100 


30 


500/500 


764 


216 


100 


491/500 


1146 


71 


Saw 


40 


40 


500/500 


474 


137 


60 


475/500 


484 


63 


Dec-4 


32 


60 


500/500 


663 


404 


150 


475/500 


814 


107 


Kaufman 


30 


150 


486/500 


1279 


883 


250 


484/500 


1488 


223 


IsoPeak 


30 


900 


483/500 


4764 


707 


400 


484/500 


1996 


246 



It can be seen in table 3 that for all functions except the last one, IsoPeak, 
performance with a prior was better than performance without a prior. 

An even more convincing reason for using the prior is given by Table 4. For 
the runs from Table 3 without prior, the population size had to be carefully 
chosen in order to have a good result. When using the prior, this choice is much 
more easy. In Table 4, several runs were made for the OneMax function with 100 
bits and different population sizes. When using the prior, both success rate and 
number of function evaluations were not much affected by choosing a population 
size between 20 and 50. 



Table 4. The effect of having the “wrong” population for OneMax with 100 bits. 



With prior 


1 Without 


prior 




N 


Succ 


FE 


(7 


N 


Succ 


FE 


a 


20 


499/500 


801 


214 


70 


405/500 


838 


54 


30 


500/500 


764 


216 


100 


491/500 


1146 


71 


50 


500/500 


786 


135 


130 


500/500 


1452 


77 



Without the prior, reducing the population size from 100 to 70 reduced the 
number of successful runs from 491 to 405 out of 500. The number of function 
evaluations decreased in this process, but because the success rate was too low, 
this is not a good choice. 

When the population sized was increased from 100 to 130, the success rate 
reached the limit of 100%. Because the number of generations for the UMDA 
without a prior depends only on the bit length, the number of function evalua- 
tions (product of population size and number of generations) has to rise. Note 
that the ratio of function evaluations (1452/1146 « 1.27) is almost the same as 
the ratio of population sizes, 1.3. 

To summarize: Without prior, choosing a too small population size leads to a 
small success rate; while choosing a too big population size is a waste of function 
evaluations. Both effects are overcome by using a prior. 
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7 Conclusions 

By using a simple argument, we were able to derive an upper bound on a good 
choice of the mutation rate both for the simple product distribution of UMDA 
and for the conditional distributions of FDA. Experimentally, it has turned out 
that this upper bound is actually a good choice for a range of fitness functions, 
bit lengths, populations sizes and truncation thresholds. By using mutation, 
the number of function evaluations can be reduced for most of the test functions 
considered. In addition, by using this method, the dependence on the right choice 
of the population size could be drastically reduced. 

UMDA with mutation is a robust search method. We recommend to use a 
population size of 50 and r = 0.3. It can also be used with very severe selection 
(e.g. T = 0.02) making UMDA-M mainly driven by mutation and similar to 
evolution strategies. 
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Abstract. This paper reports on a simple, decentralized, anytime, 
stochastic, soft graph-colouring algorithm. The algorithm is designed to 
quickly reduce the number of colour conflicts in large, sparse graphs in 
a scalable, robust, low-cost manner. The algorithm is experimentally 
evaluated in a framework motivated by its application to resource 
coordination in large, distributed networks. 

Keywords: Constraint optimization, conflict minimization, decentral- 
ized algorithms, anytime algorithms, graph colouring. 



1 Introduction 

Soft graph colouring is an extension of traditional graph colouring in which the 
hard constraint that no two adjacent nodes have the same colour is relaxed into 
a soft constraint: the number of adjacent nodes with the same colour is to be 
minimized (in other words, the number of colour conflicts is to be minimized). 

The soft version is useful when large, distributed graphs must be coloured in 
real-time: the size of the graphs, time constraints, combinatorial complexity and 
communication latency practically ensure that hard colouring will be impossible 
to achieve. Instead, an application is designed to work with graphs that are 
mostly properly coloured, but which may contain some colour conflicts. Such 
applications arise in resource coordination in large, distributed networks; for 
example, the nodes may represent resources that are to be scheduled, the colours 
may represent time slots in a cyclic schedule, and the edges may represent mutual 
exclusion constraints between the resources (see the appendix for an example) . 

It is assumed that colouring is performed simultaneously with some client 
process that requires the graph to be coloured, in an anytime, pipeline fashion: 
the colourer continually and incrementally improves the colouring and contin- 
ually feeds the current colouring to the client system. Since it assumed that 

* This work is sponsored in part by DARPA through the ‘Autonomous Negotiating 
Teams’ program under contract ^F30602-00-C-0014, monitored by the Air Force 
Research Laboratory. The views and conclusions contained in this document are 
those of the authors and should not be interpreted as representing the official policies, 
either expressed or implied, of the Defense Advanced Research Projects Agency or 
the U.S. Government. 
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higher quality colourings result in better performance by the client system, the 
emphasis is on quickly reducing the colour conflicts to an acceptable level rather 
than on deliberating at length to construct a proper colouring. 

Moreover, the colouring algorithm must be decentralized to allow it to re- 
spond quickly to changing circumstances: e.g., the graph topology changing as 
a result of hardware fluctuations. Indeed, the algorithm must be tolerant of 
hardware/communication failures. 

Finally, the colouring algorithm must be efficient in its own use of network 
resources: the colourer is bound to require, for example, communication and 
computational resources and the colourer’s use of these resources must not reach 
a level at which it interferes with the real tasks that the client process is supposed 
to be achieving. (Of course, the client process cannot achieve those tasks until 
the colourer has made some headway — a balance must be struck by the colourer 
to quickly coordinate the client process while not obstructing it.) 

In summary, the problem to be addressed is that of reducing the number of 
colour conflicts in large, distributed graphs using a quick, real-time, low-cost, 
robust, decentralized algorithm. 

The remainder of this paper is as follows: soft graph colouring is defined; a 
simple, iterative repair algorithm for soft graph colouring is introduced and some 
flaws identified by experimental evaluation; an improvement is proposed, based 
on an algorithm previously published by another author; potential problems with 
this too are identified and a final variant of the algorithm defined. Then the final 
variant is assessed against the criteria given in the preceding paragraph. 

Soft graph colouring may be viewed as an interesting and challenging problem 
in and of itself. Nevertheless, an appendix is included that outlines the use of 
soft colouring in a resource management problem, which may further illuminate 
some of the evaluation criteria. 

2 Soft Graph Colouring 

Let G be an undirected, irreflexive graph with node set N and non-empty edge 
set E. An edge between nodes u G N and v G N is denoted by {u, u}. Two nodes 
are said to be neighbours (or to be adjacent) iff there is an edge between them. 

In this paper, only fc-colourings are considered: a fc-colouring (fc > 1) (7 of a 
graph is an assignment of an integer from to each of the graph’s nodes; the 
colour of node u is denoted by C„. 

An edge {u,v} G E is said to be a conflict iff = C„. The unnormalized 
degree of conflict 7 of a colouring C is the fraction of edges that are conflicts: 
7 = |{{m, u} G E\Cu = Gjl/lAl. The unnormalized degree of conflict has range 
[0, 1], where 0 corresponds to a proper colouring and 1 corresponds to a colouring 
in which every node has the same colour. 

The chromatic number % of a graph is the smallest k for which a proper 
fc-colouring is possible. In traditional graph colouring, it is usual to consider 
fc-colourings only for k > \- However, in soft graph colouring, a fc-colouring for 
any fc > 1 makes sense, although if fc < y then it will not be possible to reduce 




An Experimental Assessment of a Soft Colourer for Sparse Graphs 



51 



— Each node randomly chooses a colour from , collectively constructing the initial 

colouring . 

— The following synchronized loop is then performed indefinitely, for step s = 1, 2, 
by each node i: 

1. Determine an activation probability ai G [0, 1]. 

2. Generate a random number Vi G [0, 1). 

3. Ghoose a colour for the next step: 

a. If Vi < ai, choose such that it minimizes the conflicts with i’s neigh- 
bours. 

b. If n > ai, choose = C" (i.e., no change). 

4. If yf Ci, inform neighbours. 

Fig. 1. Synchronous, stochastic min-conflicts colouring algorithm 



7 to 0. In general, the analytical determination of the smallest possible value of 
7 for a given A: < y is not straightforward. 

If the nodes are randomly fc-coloured, the expected value of the unnormal- 
ized degree of conflict is k~^. This suggests a normalization that is particularly 
useful when colourings using different number of colours are to be compared: 
the normalized degree of conflict G of a fc-colouring is defined by T = Icy. Using 
this metric, a random ^-colouring has an expected value of 1. The unqualified 
phrase degree of conflict should be taken to refer to the normalized metric. 

Note that a random colouring can be produced in a distributed environment 
without communication. A colourer that incurs communication costs can thus 
be required to reduce the normalized degree of conflict below 1 if it is to be 
considered useful. 



3 A Decentralized, Synchronous, Stochastic Algorithm 

The objective of a soft graph colourer is to quickly reduce the degree of conflict, 
optimally to 0 when the number of colours is at least the chromatic number. In 
a decentralized environment, computing the degree of conflict at run-time is not 
feasible because the communication and coordination costs are prohibitive. 

However, it is assumed that each node is autonomous and can directly com- 
municate with its neighbours (i.e., the constraint network is a subgraph of the 
communication network) . So the general framework for colouring is one in which 
each node determines its own colour and collaborates with its immediate neigh- 
bours to reduce conflicts. 

Specifically, if each node informs its neighbours when it changes its colour, 
then each node can compute how many conflicts it currently has with its neigh- 
bours, and strive to reduce this number. If every node manages to achieve zero 
conflicts with its neighbours, then the (global) degree of conflict will also be 
reduced to zero (since there will be no conflicts in the graph). 

This is the basis for the synchronous, decentralized fc-colouring algorithm 
shown in Figure 1. Each node initially chooses a colour at random. Then the 
nodes repeatedly update their colours in synchronized steps. 
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In each step, each node decides whether or not to activate by comparing a 
randomly generated number with some activation probability. If the node ac- 
tivates, then it chooses a colour that minimizes the number of colour conflicts 
that it has with its neighbours based on their colours in the previous step. Those 
nodes that change colour inform their neighbours: all of a node’s operations are 
thus based on information that it has available locally. 

One danger in this synchronized algorithm is that neighbours can perform 
colour changes simultaneously, so while one node is striving to accommodate its 
neighbours’ choices of colours, those neighbours may also be striving to accom- 
modate its colour choice. Thus, it is possible that by trying to reduce conflicts, 
simultaneously activated neighbours preserve or even introduce conflicts. 

For a trivial example, consider a 2-colouring of the two node graph shown 
in Figure 2. Initially, both nodes have colour 1 (by chance); then both nodes 
activate, and both adopt colour 0, since that was the one colour unused by their 
neighbour(s). 

In such a case, the nodes may be said to be acting incoherently. Incoherence 
could be eliminated by imposing a total order on the nodes so that only one 
changes at any given step. However, such a sequential solution is not scalable, 
and is overly severe: the algorithm, in general, seems to be robust enough to 
tolerate some level of incoherence. 

O © 

cf-d) 

Fig. 2. Incoherent colouring 



What is needed is a method for balancing parallel activity against the prob- 
ability that two neighbours will activate simultaneously. This balance can be 
achieved by adjusting the activation probabilities (see Step 1 of Figure 1). 

Unfortunately, the authors do not know of any analytical way of determining 
optimal activation probabilities a priori. Nevertheless, experiments with a simple 
algorithm, the Fixed Probability colourer (FP), have yielded some interesting 
results and suggest that the general framework is useful. 

3.1 The Fixed Probability Colourer 

In the FP(a) algorithm, the activation probabilities Oj are all set uniformly to the 
constant a. Figure 3 shows FP’s typical behaviours over 10000 steps for various 
values of a (note that the step axis is logarithmic and that the experiments were 
abbreviated for a = 0.9). The best values for a, in this case, are around 0.3-0. 5; 
0.3 was chosen for most of the experiments reported below. 

The behaviours are averages over 20 graphs of various sizes (in the range 1000 
to 5000 nodes) in which the nodes are arranged in a regular 2-dimensional grid 




An Experimental Assessment of a Soft Colourer for Sparse Graphs 



53 




1 10 100 1000 10000 
step 

Fig. 3. Effect of activation probability on 
FP, 2D grids 




Fig. 4. A 2D grid 



and the edges occur between nodes that are adjacent along an axis or diagonal 
(see Figure 4). The chromatic number of such grids (of large enough size) is 4. 

Experiments yielding similar results were also performed with 3-dimensional 
grids (having edges along axes and diagonals, mean degree 26 and chromatic 
number 8) and with random 10-colourable graphs of mean degree 50. The ran- 
dom graphs were constructed by randomly colouring the nodes with 10 colours 
and then randomly generating the requisite number of edges between randomly 
chosen nodes of different colours (and, of course, finally discarding the colour- 
ing) . This method of construction ensures that the graph’s chromatic number is 
no more than 10, and is likely exactly 10. 

These random graphs are similar to those reported used in other colouring 
experiments, with the difference that typically the edge probability is fixed, 
rather than the mean degree. The latter is more convenient for the experiments 
reported here, as the behaviour of the colourer seems to be quite independent 
of the number of nodes when the mean degree is fixed, whereas the mean degree 
varies with the number of nodes when the edge probability is fixed. 

Referring again to Figure 3, for a = 0.9, the colourer performs worse than 
random; i.e., the degree of conflict that it produces is higher than the expected 
value for randomly colouring the nodes. By measuring the number of nodes that 
change in each step, it can be determined that not only is the colouring extremely 
poor, but also that it is constantly changing. Such behaviour — a high degree 
of change but no improvement — may be called thrashing, and it is probably as 
undesirable a behaviour as can be produced, for the colourer not only delivers 
a low quality colouring to its client, it also consumes a large amount of system 
resources (for communication and perhaps computation). 

For a = 0.85, the degree of conflict eventually reduces to a low level (as low 
as for smaller values of a). However, there is a significant period (steps 1 to 100, 
say) during which the degree of conflict is high relative to that achieved with 




54 



Stephen Fitzpatrick and Lambert Meertens 





number of colors 



Fig. 5. Effect of number of colours on FP, 2D grids 



smaller a. This is considered undesirable behaviour since during that period, the 
colouring is of low quality and will degrade the performance of whatever client 
process uses the colouring. 

For a = 0.1, the degree of conflict reduces smoothly to a low value. This 
is probably acceptable behaviour, but note that the degree of conflict can be 
reduced more quickly to an equally low value by choosing a slightly higher a — 
this would be better behaviour. 

In other words, the objective is to choose an a. that nimbly reduces the 
degree of conflict to a low level. This objective is made more interesting due to 
local minima: it can happen that lower values of a cause the colourer to become 
trapped in a local minimum that would be quickly escaped if a were significantly 
higher (due to extra incoherence — this is similar to a downward step in a hill- 
climbing algorithm). Whether or not the possibility of better long-term values 
outweighs the possibility of worse short-term values can really only be decided 
in the context of a specific application that provides concrete quality metrics. 

3.2 Effect of Number of Colours 

A /c-colouring problem may be characterized based on the number of colours 
k relative to the chromatic number y of the graph to be coloured: a prob- 
lem is over-constrained when fc < y, critically constrained when fc = y, and 
under-constrained when fc > y; a problem may also be characterized as loosely 
constrained when fc >> y. 

The performance of FP on these types of problem is shown in Figure 5: the 
left plot shows the degree of conflict as the colouring algorithm proceeds, while 
the right plot shows the mean of the degree of conflict achieved after 10000 steps. 

It can immediately be seen that FP’s performance is disappointing on loosely 
constrained problems: FP performs worse than on critically constrained prob- 
lems, producing asymptotic degrees of conflict that are significantly above zero. 
Moreover, this asymptotic value initially increases with the number of colours, 
leveling off at very high numbers of colours. 
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This effect can be explained as follows. Consider the step of the FP algorithm 
in which a node chooses a colour that minimizes its conflicts with its neighbours. 
When the number of colours k is much greater than the chromatic number 
the number of colours being used by any given node’s neighbours is likely to be 
small compared with number of colours available, so each node will have a large 
number of zero-conflict colours from which to choose. It chooses one of these at 
random. Thus, it behaves partly like a random colour and, if all of the nodes 
were activating on each step, the expected value of the normalized degree of 
conflict would be 1. 

A simple probabilistic analysis can account for the activation probability and 
gives r « ka/[{k — <5)(2 — a)] where S « fc is a small correction to account 
for the average number of colours used in a neighbourhood. Experimentally, 
it is found that F is close to the asymptotic value predicted by this formula, 
a/(2 — a), for large numbers of colours, as shown in Figure 5 (right). 

In the next two sections, possible improvements to FP are considered. 



3.3 Deterministic Fixed Probability Colourer 

One way to improve FP’s under-constrained behaviour is to make the choice of 
colour deterministic. For example, step (3a) of the algorithm (Figure 1) can be 
modified to choose the smallest colour that minimizes the number of conflicts; 
denote the resulting algorithm as DFP. 

However, as shown in Figure 6, this simple form of deterministic colour choice 
also has undesirable behaviour when under-constrained: while the convergence 
value of the degree of conflict does reduce to zero, as intended, the degree of 
conflict temporarily rises to extreme values during the first few steps. 

A proposed explanation for these short-term peaks is as follows: after random 
initialization with a high number of colours, there will be many sets of neigh- 
bours that have the same smallest, optimal colour. When such sets of neighbours 




Fig. 6. DFP on 2D grids 



Fig. 7. DFP vs. FP 
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activate, they will introduce conflicts. (Note that the peak occurs around steps 
2 and 3, by which time a majority of the nodes are expected to have activated, 
since the activation probability is 0.3.) 

This short-term, aberrant behaviour can be ameliorated, if not quite elimi- 
nated, by biasing the nodes’ colour choices in a non-uniform manner. For exam- 
ple, each node can be given a unique integer identity tag (or generate a random 
integer at startup). When choosing a colour, it chooses the colour that, primarily, 
minimizes the conflicts and, secondarily, minimizes the value of (identity-l-colour) 
modulo the number of colours. 

However, there remains another problem with DFP, one that is common to 
deterministic hill-climbing algorithms: it is much more likely than non-determ- 
inistic algorithms to become stuck in a local minimum. Figure 7 compares DFP 
with FP for small numbers of colours — while their initial rates of reducing the 
degree of conflict are similar, FP continues the reduction long after DFP stabi- 
lizes. (By measuring the number of nodes changing colour, it can be confirmed 
that DFP’s colouring becomes fixed, not just its degree of conflict.) 



3.4 Conservative Fixed Probability Colourer 

The basic characteristic of FP that causes poor performance on under- 
constrained problems is that it continues to change colour even when it has 
no conflicts with its neighbours. In a hill-climbing algorithm, this is not neces- 
sarily a bad characteristic: prematurely fixing colours may cause the algorithm 
to become stuck in a local minimum, as in DFP. 

Nevertheless, FP can be simply modified so that a node will not activate if it 
has no conflicts with its neighbours. Denote the modified algorithm as CFP 
{Conservative Fixed Probability). Figure 8 (left) compares the behaviour of 
FP(0.3) and CFP(0.3) for numbers of colours ranging from 2 to 5 on graphs 
with chromatic number 4 — their performances are virtually indistinguishable. 
Figure 8 (right) compares the performance for higher numbers of colours — CFP 
performs much better, quickly eliminating all conflicts. 

These experiments suggest that CFP does not suffer from local minima (at 
least no more than FP). In light of its much better performance for under- 
constrained colourings, the rest of this paper will deal only with CFP. 

4 Assessment of CFP 

Figure 9 summarizes the performance of CFP for an activation probability of 
0.3. In this section, some further details of this performance are noted, and then 
the CFP colourer is assessed with respect to the desired characteristics listed 
earlier: cost, scalability and robustness. 

When the number of colours is equal to the chromatic number, 4, the degree 
of conflict is reduced to a low value, 0.03, but not to zero. When the number of 
colours is 5, the degree of conflict is reduced to 6 x 10“®. For higher numbers of 
colours, the degree of conflict is quickly reduced to exactly zero (after a about 
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Fig. 8. CFP vs. FP 



140 steps for 7 colours, and 40 steps for 8 colours). This is typical of the colourer: 
it first achieves proper colourings slightly above the chromatic number. 

When the number of colours used to colour a graph is less than the graph’s 
chromatic number, the degree of conflict cannot be reduced to zero. Nevertheless, 
it is still to be hoped that a colourer will manage to get close to the minimum. 
This presents a problem when assessing the performance of a colourer, since 
typically the minimum is unknown from analytical considerations, and finding 
it using, for example, branch and bound search is not likely to be feasible given 
the graph sizes. 

However, some regular graphs, such as the n-dimensional grids, do yield at 
least lower bounds to analysis. While there is a risk that a colourer’s performance 
on these graphs is not representative of its performance on more general graphs, 
it is the best information available so far. 





Fig. 9. Performance of CFP 



Fig. 10. Communication costs of CFP 
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For the (infinite) 2D grids, the chromatic number is 4. For a 2-colouring, 
the minimum degree of conflict possible is exactly 0.5; after 10000 steps, CFP 
has reduced the degree of conflict to 0.509 on average. For a 3-colouring, the 
minimum degree of conflict is exactly 0.375; FP achieves a mean of 0.382. 

Although preliminary, these results, combined with those for 3D grids and 
random graphs, suggest that CFP performs well on sparse graphs. 



4.1 Communication Costs 

In the absence of concrete cost metrics for particular applications, the primary 
cost metric for a decentralized colourer is the amount of communication it uses. 
This is measured as the fraction of nodes that change colour per step, since each 
such change requires communicating the new colour to each neighbour. 

Figure 10 shows the communication costs incurred by CFP(0.3) to colour 2D 
grids with various numbers of colours. The general pattern for communication 
costs is that the costs are initially large after random initialization while many 
conflicts are resolved. As the colouring stabilizes, the costs reduce. 

When the colouring is over constrained, the degree of conflict cannot be 
reduced to zero and CFP will, in general, continue to try to improve the colouring 
even if it achieves an optimal colouring. (This general rule may be violated by 
2-colourings which can give rise to extremely stable, large-scale regions of proper 
colourings, with conflicts occurring only along their borders. In these cases, CFP 
may reach a stable colouring.) 

Although the rate of colour change remains low, the fact that it does not 
reach zero should be considered a weakness of the CFP colourer — ideally, there 
would be a way to determine that a colouring, although improper, is nevertheless 
(near) optimal, and adapt the colourer’s behaviour accordingly. 

When the number of colours is equal to or just larger than the chromatic num- 
ber, the communication costs settle down to a very low value. When the num- 
ber of colours is significantly greater than the chromatic number, CFP rapidly 
achieves a proper, stable colouring and the communication costs drop to zero. 



4.2 Scalability 

CFP is a minor modification to FP so it should be clear from Figure 1 that CFP 
is scalable in the number of nodes: the per-node, per-step computational, storage 
and communication costs are dependent on the mean degree of the graph and 
the number of colours rather than on the number of nodes. 

Of course, for some types of graph (e.g., complete graphs), the mean degree is 
proportional to the number of nodes. However, such graphs are not likely to arise 
in large sensor networks, since the network itself would likely not be scalable. 

Experimentally, we find that for large, sparse graphs, the performance of the 
colourer (not just its costs) shows no dependence on the number of nodes. This 
justifies the use of averages over different graph sizes in reports of experiments. 
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Fig. 11. Communication noise 




Step 



Fig. 12. Communication loss 



4.3 Robustness 

The robustness of CFP was explored in two ways: (1) observe the effect unreliable 
communication has on the degree of conflict: (2) observe the effect a dynamic 
graph topology has on the degree of conflict. 



Unreliable Communication. In a distributed environment, each time a node 
changes colour it sends a message containing the new colour to its neighbours. 
These messages can be subjected to a random process in which they may be: 
corrupted by having the colour field changed randomly (with probability r); 
completely discarded (with probability d); or passed through unchanged. 

Figure 11 shows the effect communication noise has on CFP(0.3) when com- 
munication loss is kept at zero, and Figure 12 shows the effect of communication 
loss when communication noise is kept at zero. In both cases, small values of 
noise/loss proportionately increase the degree of conflict. For large amounts of 
noise/loss, the increase is significant. 

However, given that the colourer is operating in a distributed environment, 
this is to be expected — in the absence of any application-specific metrics, 
what is important is that the algorithm continues to function under unreliable 
communication rather than catastrophically failing. 



Dynamic Topology. To partially simulate sensor failure and recovery, nodes 
and edges can be removed and reinserted into the graph over time. The process 
is designed so that if the rate/extent of change is small, then it is likely that 
structural aspects of the graph (e.g., the chromatic number) will not be much 
changed (such changes could complicate analysis of the experimental data). 

Figure 13 shows the effect of continuous change at levels of less than 10% of 
the nodes per step — there is very little effect. Figure 14 shows a typical response 
to intermittent change: 20% of the nodes were affected every 30 steps. The degree 
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Fig. 13. Continuous topology change 




Step 



Fig. 14. Intermittent topology change 



of conflict spikes immediately after each change, but it quickly and robustly 
decreases afterwards. Indeed, in this case, the degree of conflict continues to 
decrease in spite of the topology changes. However, it is to be expected that at 
sufficiently high levels of change and sufficiently short periods between changes, 
the colouring would degrade. 

5 Related Work 

The Deterministic Fixed Probability algorithm was, to the authors’ knowledge, 
first published by Fabiunke [3] in the context of general distributed constraint 
satisfaction. He uses a randomization technique to break out of local minima 
and generally achieve proper colourings: however, the cost is that the degree of 
conflict remains high initially. In the framework of this paper, that is likely to 
be undesirable. 

Yokoo at al. have published several algorithms for distributed constraint sat- 
isfaction [6]. They are concerned primarily with complete algorithms (i.e., algo- 
rithms guaranteed to find a feasible solution when one exists, and to terminate 
if one does not exist) and thus their algorithms are considerably more complex 
and incur considerable overheads to track the search space. 

Lemaitre and Verfaillie [4] consider soft constraint optimization in a dis- 
tributed setting, but their algorithm uses a central coordinating agent and seems 
to be sequential and unscalable. They do suggest that their algorithm can be 
parallelized, but do not consider details. 

Aspects of iterative algorithms for hard colouring have been widely studied. 
Two papers that are particularly relevant are by Lewandowski and Condon [5], 
and Culberson [2] . However, hard colouring typically seems to deal with a time- 
intensive effort to (properly) colour a single, fixed graph using as few colours as 
possible. The ongoing reduction of conflicts in a dynamic graph does not seem 
to be as widely studied. 
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6 Conclusions and Future Work 



This paper presents a simple framework for studying real-time constraint op- 
timization in the form of soft graph colouring. It presents a set of criteria for 
assessing the performance of a distributed, soft colourer. It modifies an already 
published algorithm to enhance its performance over a wide range of problems 
and presents the results of experimental assessments. 

The soft colourers were found to be scalable, low cost, robust and capable of 
responding in real-time. They should prove useful for resource coordination in 
large networks. 

Although graph colouring is a restricted form of constraint satisfaction, it 
nevertheless may be conjectured that the algorithms discussed in this paper are 
readily extensible to the more general problem. However, it seems unnecessary 
at this stage of research to introduce additional details into the framework: there 
are still many interesting problems to be addressed in the simpler form. 

Some of those problems include: determining optimal activation probabili- 
ties in a local manner; determining the minimum degree of conflict for over- 
constrained colourings; recognizing that an improper colouring is nevertheless 
an optimal colouring; dynamically adjusting the number of colours. Yet more 
problems, such as phase-transition phenomena, have presented themselves in 
initial investigations into colouring denser graphs. Finally, the algorithms can 
be extended to operate asynchronously. 
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Appendix 

A Motivation from Distributed Resource Management 

The advent of small, simple, battery-powered sensors equipped with low-power, 
unreliable radio communication has stimulated interest in the deployment of 
large, distributed, autonomous networks of perhaps tens of thousands of such 
sensors [1]. 

To accomplish specific tasks (e.g., to track a moving target) several of the 
sensors typically must coordinate their actions (e.g., to simultaneously scan the 
same target to achieve accurate triangulation). The nature of the tasks may be 
such that real-time coordination is required (e.g., there is only a finite period 
over which a given sensor is capable of scanning a given moving target). 

The number of tasks that a single sensor can service at any given time is 
bound to be limited (e.g., a sensor may be able to scan in only one direction at a 
time) so if several tasks impinge upon a small group of sensors (as may happen 
when several targets are close together) it becomes necessary to distribute the 
tasks to avoid overtasking sensors. 

The number of tasks may vary dramatically over a short period. At times, the 
number of tasks is expected to approach or even exceed the theoretical capacity 
of the network, requiring the tasks to be carefully distributed to ensure that 
as many as possible are accomplished. At other times, the number of tasks is 
expected to be low, and the network is required to accomplish its tasks with a 
high degree of success and at low cost. 

The cost may be measured, for example, in terms of the amount of battery 
energy expended or amount of communication required. (Communication may 
be considered to be costly because it may help alert an adversary to the presence 
and location of passive sensors.) The sensors may be required to operate in an 
unfavourable environment: sensors and associated computation nodes may fail 
and revive, and communication may be unreliable. 

Given these considerations, a centralized coordination mechanism is not con- 
siderable feasible: it would be a computational and communication bottleneck. 
What is required is a decentralized coordination mechanism that is scalable, 
real-time, low-cost and robust, and that can perform well under dynamically 
varying task load. 



A.l Abstraction — Graph Colouring. In its general form, sensor coor- 
dination can involve constraints on sensors and tasks, cost metrics on resource 
consumption, and quality metrics on task accomplishment that are arbitrarily 
complex. In this section, the problem is simplified so that it can be viewed as 
graph colouring. This simplification retains most of the features that make dis- 
tributed coordination an interesting problem (since distributed graph colouring 
is just as computationally challenging), while providing a well-known context. 

To view resource coordination as graph colouring, assume that we can impose 
a graph structure on the sensor network so that each task can be accomplished 
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by a single node within a predetermined time limit, and the constraints between 
the sensors give rise to mutual exclusion constraints between the nodes. 

For example, a network of simple sensors can be reformulated into a graph in 
which the nodes are virtual sensors, each of which is an aggregate of three real 
sensors capable of triangulating a target. Two virtual sensors are not allowed to 
be active simultaneously if they share a real sensor (under the assumption that 
each real sensor can service only one task at a time). 

Given a graph in which the nodes represent sensors and the edges represent 
mutual exclusion constraints, a proper node colouring of the graph represents a 
feasible schedule for the virtual resources: 

— A colour can be viewed as a time period in a cyclic schedule. For example 
‘colour 3’ in a 10-colouring may mean ‘activate during periods 3-4 seconds, 
13-14 seconds, 23-24 seconds, etc’. 

— If two nodes are adjacent (i.e., connected by an edge) and they have the same 
colour, then the virtual resources represented by the nodes are scheduled for 
simultaneous activation even though they are mutually exclusive. An edge 
joining two nodes of the same colour may thus be considered to be a conflict. 

— A proper colouring, by definition, has no conflicts. Thus, no two mutually 
exclusive virtual resources are active simultaneously, and the colouring rep- 
resents a feasible schedule. 

Given its operating environment, it is to be expected that the coordination 
mechanism may fail to construct a proper colouring and still respond in real-time. 
A flawed colouring (i.e., one that contains conflicts) is still of use to the sensor 
network: some of the tasks will fail, because a conflict represents an unfulflllable 
request for a real sensor to service two tasks at the same time, but some of the 
tasks will succeed. The fewer the conflicts the better. 

It is considered better for the coordination mechanism to deliver a (some- 
what) flawed colouring on time than to wait until it has constructed a proper 
colouring, since this may involve an intolerable delay (given that graph colouring 
is NP hard, the graphs are large and the coordination mechanism uses high- 
latency, unreliable communication). 

Moreover, while the sensors are executing the schedule represented by the 
flawed colouring, the coordination mechanism can continue to improve the 
colouring. In other words, the coordination mechanism can operate as an any- 
time process. 

Finally, the number of colours used to colour the graph directly determines 
the length of the cyclic schedule, since each colour represents a fixed period of 
time (determined by how long a sensor needs to scan a target once — the dwell 
time). If the schedule length is too long, then there is a good chance that a 
given target will be able to move entirely though a given sensor’s scanning range 
without that sensor becoming active — this is not desirable (the revisit period is 
too high). Thus, the number of colours may be fixed by physical considerations. 
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Abstract. Branching programs are a graphical representation of 
Boolean functions which are considered as a nonuniform model of 
computation in complexity theory and are also used as a data structure 
in practice. The talk discusses randomized variants of branching 
programs which allow to study the relative power of deterministic, 
nondeterministic, and randomized algorithms in a scenario where space 
is the primary resource. 

Keywords: Randomized branching program, OBDD, read-k-times, 
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1 Introduction 

Theoretical computer science has been quite successful in inventing many useful, 
practically relevant formal models of computation. Nevertheless, theory has em- 
barrassingly failed so far in the task of characterizing the relative power of the 
most basic models of computation. The classical “P NP”-problem of complex- 
ity theory, i.e., the question of whether NP-complete problems admit efficient, 
polynomial time algorithms, is unsolved now for over 30 years and will perhaps 
remain so for a much longer time. The reason for this is our profound lack of suf- 
ficiently powerful techniques for proving lower bounds on complexity measures 
such as time or space on Turing machines. 

In the meantime, the techniques for designing algorithms, i.e., for proving up- 
per bounds, have become fairly advanced. Randomized algorithms have turned 
out to be a powerful tool in many different areas of application. For some prob- 
lems, randomized algorithms are more efficient than the best known deterministic 
algorithms (this is still true, e. g., for primality testing, for which polynomial time 
deterministic algorithms only exist under unproven number theoretical assump- 
tions). If the error probability of a randomized polynomial time algorithm for a 
decision problem can be bounded by a constant smaller than 1/2, then it can be 
further decreased by a polynomial number of independent repetitions to be of 
the order 2“ ^ input size. This is as good as a deterministic algorithm 

for many applications. Hence, it is justified to identify the notion of an efficient 
algorithm with a randomized polynomial time algorithm with error probability 
bounded by a constant smaller than 1/2. The complexity class BPP (bounded 
error probabilistic polynomial time) contains all decision problems which can be 
solved by such algorithms. 

K. Steinhofel (Ed.): SAGA 2001, LNCS 2264, pp. 65-71, 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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Given the importance of randomized algorithms, we face some more “mod- 
ern” open questions in complexity theory in addition to the classical “P yf NP” 
problem. It is desirable to know whether randomized algorithm can really be 
provably more powerful than deterministic ones. In theoretical terms, this is 
the open problem of whether P C BPP. Somewhat surprisingly, there are some 
powerful derandomization techniques (see, e. g., Impagliazzo and Wigderson [11], 
Wigderson [22], and Andreev et al. [5,6]) which may be seen as a support for 
the belief that in fact P = BPP. Furthermore, it would be important for prac- 
tical reasons to know whether NP-complete problems can be efficiently solved 
by randomized algorithms or not. In the language of complexity theory, this is 
the open problem of whether NP C BPP. This is unlikely to be the case, since 
Ko [12] has shown that NP C BPP would imply that all classes in the so-called 
polynomial time hierarchy are identical to BPP (and most people believe that 
this is not true). 

To sum up, we do not have definite answers even to the basic questions 
concerning the relationship between the classes P, NP, and BPP, since we lack 
appropriate lower bound techniques for obtaining separation results. One ap- 
proach in this situation is to establish results under yet unproven (but plausible) 
assumptions (we have mentioned examples above). In particular, relativized re- 
sults belong into this category. On the other hand, one may also study alternative 
models of computation which promise to be easier to handle by combinatorial 
tools than the somewhat “unwieldy” classical Turing machine. 

The latter approach has been quite successful for some nonuniform mod- 
els of computation, such as circuits, communication protocols, and branching 
programs. A nonuniform model of computation allows to specify a sequence of 
algorithms, one for each input length n, for computing a sequence of Boolean 
functions /„: {0, 1}” — >■ {0, 1}. Branching programs describe sequential compu- 
tations in an especially handy way. Furthermore, complexity classes defined in 
terms of other well-known nonuniform models of computation may be equiva- 
lently characterized in terms of branching programs. 

Definition 1. A (deterministic) branching program (BP) on the variable set 
{x \, . . . , Xn} is a directed acyclic graph with a single source node and sinks labeled 
by one of the constants 0 and 1. Each internal (non-sink) node is labeled by a 
variable Xi and has exactly two outgoing edges carrying labels 0 and 1, resp. This 
graph represents a Boolean function f: {0, 1}” — >■ {0, 1} in the following way. To 
compute f{a) for some input a G {0, 1}”, start at the source node. For a non- 
sink node labeled by Xi, determine the value of this variable and follow the edge 
which is labeled by this value (this is called a “test of variable Xi”). Iterate this 
until a sink node is reached. The value of f on input a is the value of the reached 
sink. For a fixed input a, the sequence of nodes visited in this way is uniquely 
determined and is called the computation path for a. The size of a branching 
program G is the number of its nodes. The branching program size of a function 
is the minimal size of a branching program representing it. 
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Usually, we consider sequences of branching programs representing sequences 
(/n)nsN of Boolean functions, where /„: {0,1}” — >■ {0,1}. Somewhat sloppily, 
we often talk of functions where we really mean “sequences of functions.” 

It is a well-known fact that sequences of functions which can be computed by 
branching programs of polynomial size can also be computed within logarithmic 
space on the nonuniform variant of Turing machines and vice versa (Cobham [10], 
Pudlak and Zak [14]). In terms of complexity classes, this means that the class 
of sequences of functions with polynomial branching program size is equal to the 
class L/Poly. Proving a superpolynomial lower bound on the branching program 
size for, e.g., a sequence of functions (/n)neN whose corresponding language 
Lf ■■= U„eN /„ ^(1) is contained in P would in particular establish that L C 
resolving another main open problem of complexity theory. On the practical side, 
this result would imply that no P-complete problem can be efficiently solved by 
a parallel algorithm. 

So far, no superpolynomial lower bound on the branching program size of 
concrete functions are known. Nevertheless, superpolynomial and exponential 
lower bounds could be proven for several restricted variants of branching pro- 
grams. One major goal in branching program theory is to extend the available 
techniques to more and more general models. The last years have brought some 
astonishing progress along this line. Here, we consider the following restricted 
variants of branching programs. 

• An OBDD (ordered binary decision diagram) is a branching program which 
is associated with an arbitrarily permuted list of the input variables, called 
the variable order of the OBDD. It is required that the list of variables on 
each path from the source to one of the sinks in the graph is a sublist of the 
variable order. 

• A read-k-times branching program is a branching program where each variable 
may appear at most k times on each path from the source to a sink. For 
A: = 1, we call the respective graphs read-once branching programs. OBDDs 
are obviously special read-once branching programs. 

• A linear-length branching program is a branching program where the number 
of variables on each computation path is required to be linear in the number 
of input variables. Notice that this restriction does not extend to so-called 
inconsistent paths, i. e., paths where the same variable is tested with different 
results and which thus cannot be computation paths for any input assignment. 

OBDDs are one of the most restricted variants of branching programs, but 
have turned out to be extremely useful in practice as a data structure for Boolean 
functions. The reason is that they allow to compactly represent many practically 
relevant functions and at the same time can be efficiently manipulated by al- 
gorithms, e.g., for computing the disjunction or conjunction of representations 
and for checking satisfiability or equivalence. 

Read-once branching programs are the restricted type of branching programs 
for which the first exponential lower bounds on the size could be proven (Zak [23], 
Wegener [20]) and have been extensively been studied since then. 
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Traditionally, lower bounds for read-/c-times models have been investigated as 
the first step towards time-space tradeoffs. The length of the paths in a branching 
program corresponds to the computation time. In a read-fc-times branching pro- 
gram, this length is obviously bounded by kn. The first exponential lower bounds 
for read-fc-times branching programs for values of k larger than 1 (in fact, for k 
being allowed to be a logarithmic function of the input size) have been proven 
by Borodin, Razborov, and Smolensky [9] and by Okolnishnikova [13]. 

The latest records in the competition for lower bounds on the size of less 
and less restricted variants of branching programs have been achieved for linear- 
length branching programs. Beame, Saks, and Thathachar [8],Ajtai [3,4] and 
Beame, Saks, Sun, and Vee [7] have proven exponential lower bounds for this 
model. Their results also give the first non-trivial time-space tradeoffs for the 
general model of branching programs. Furthermore, some of the respective lower 
bounds even hold for the randomized variant of linear-length branching pro- 
grams, as discussed later on. 

For more background information and an extensive survey of results for all 
variants of branching programs, we refer to the monograph of Wegener [21]. 

2 Randomized Branching Programs 

Given the polynomial relationship between the size of branching programs and 
the logarithm of the space complexity for nonuniform Turing machines, branch- 
ing programs are also a suitable model for comparing the power of determinism, 
nondeterminism, and randomness in the nonuniform, space-restricted setting. In 
the following, we introduce randomized and nondeterministic variants of branch- 
ing programs. 

Definition 2. A randomized branching program is a branching program with 
the following additional properties: (i) The graph may contain unlaheled nodes 
with two unlaheled outgoing edges, (ii) A randomized branching program may 
contain three sinks labeled by 0, 1, and “?”. Nodes labeled by variables are called 
decision nodes, while the unlaheled nodes are called randomized nodes. An in- 
put a defines a randomized computation path in the graph from the source to a 
sink as follows. At decision nodes, the successor is chosen as for deterministic 
branching programs. At randomized nodes, each of the two successors is chosen 
with probability 1/2, and this decision is independent from all other random de- 
cisions. Let G{a) he the random variable which is equal to r G {0,1,?} if an 
r-sink is reached by the randomized computation path for input a. We consider 
the following modes of acceptance. We say that G computes a function f defined 
on the decision variables 

- with bounded (two-sided) error e, 0 < e < 1/2, ifPr{G{a) = f{a)} > 1 — e 
for all inputs a; 

- with one-sided error e, 0 < £ < 1, z/Pr{G(a) = 1} > 1 — e for all a G /“^(l) 
and Pr|G(a) = 0} = 1 for all a G /“^(O); 
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- nondeterministically (or with unbounded one-sided error^, z/Pr{G(a) = 1} > 

0 for all a € and Pr{G(a) = 0} = 1 for all a € f~^(0); 

- with zero error and failure probability e, 0 < £ < 1, z/Pr{G(a) = ~<f{a)} = 0 
and Pr{G(a) = ?} < e for all inputs a. 

In the nondeterministic case, randomized branching programs are called nonde- 
terministic branching programs and unlaheled nodes are called nondeterministic 
nodes. 

Complexity classes for branching programs analogous to P, NP, BPP etc. 
are defined by replacing polynomial time complexity with polynomial branching 
program size in the respective model. Due to the difficulty of proving separation 
results, it is not too surprising that already for the more restricted variants of 
branching programs there are several open problems concerning the relationship 
between the different modes of acceptance. 

3 A Collection of Results 

In the talk, some exemplary results on the complexity of randomized OBDDs, 
read-/c-times branching programs, and linear- length branching programs are pre- 
sented. This is by no means an exhaustive treatment. Instead, the presentation is 
intended to give an impression of the strengths and weaknesses of what we have 
been able to do so far for randomized branching programs. For more information, 
we refer to the literature and the monograph [21]. 

First of all, OBDDs turn out to be one of the few models in complexity theory 
for which the relative power of determinism, nondeterminism, and randomness 
can be characterized almost completely. The investigation of the randomized 
variant of OBDDs has been initiated by Ablayev and Karpinski [2] who have 
presented a concrete function for which randomized OBDDs require only poly- 
nomial size, while every deterministic OBDD requires exponential size. Further 
upper bounds for randomized OBDDs have been presented by the author [16]. All 
these upper bounds are based on the so-called fingerprinting technique. Lower 
bounds for randomized OBDDs have been shown by Ablayev [1] and the au- 
thor [17] using results from communication complexity. 

It is much harder to prove exponential lower bounds already for random- 
ized and nondeterministic variants of read-once branching programs. The first 
exponential lower bounds for randomized read-once and read-/c-times branching 
programs (for k bounded by a logarithmic function in the input size) have been 
achieved by the author in [15,18] . The technique is based on ideas due to Borodin, 
Razborov, and Smolensky [9] for the nondeterministic case and proceeds in two 
steps. First, one uses the structure of the given randomized read-/c-times branch- 
ing program to obtain a partition of the input space of the represented function 
into “well-structured” subsets, which are suitable variants of the combinatorial 
rectangles from communication complexity theory. In the second step, one then 
derives a lower bound on the number of rectangles in such a representation by 
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combinatorial means. Usually, the second step is based on discrepancy lower 
bounds obtained by algebraic techniques. 

The described proof technique has also been applied by Thathachar [19] to 
show that the classes of sequences of functions with polynomial size read-fc-times 
branching programs form a proper hierarchy with respect to k. Furthermore, by 
more sophisticated constructions, it is also possible to reduce the structure of a 
linear-length branching program to a combinatorial problem defined in terms of 
an appropriate variant of combinatorial rectangles. This is the idea behind the 
more recent results for linear-length branching programs due to Beame, Saks, 
and Thathachar [8], Ajtai [3,4] and Beame, Saks, Sun, and Vee [7], where the 
latter paper even contains exponential lower bounds for the randomized variant 
of this model. 

Studying the randomized mode of computation for restricted branching pro- 
grams has already provided new insights into the power of randomized algorithms 
which went beyond what we already knew from communication complexity the- 
ory. Nevertheless, the results obtained so far for the more general variants of 
branching programs all work only for rather artificial toy functions. There is still 
much to be done to solve the lower bound problem for real-life functions and to 
improve nearly all parameters in the results we have, like, e. g., the bounds on 
the error probability. Furthermore, lower bounds for general branching programs 
remain still elusive. 
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Abstract. We propose a generic, domain-independent local search 
method called adaptive search for solving Constraint Satisfaction 
Problems (CSP). We design a new heuristics that takes advantage of 
the structure of the problem in terms of constraints and variables and 
can guide the search more precisely than a global cost function to 
optimize (such as for instance the number of violated constraints). We 
also use an adaptive memory in the spirit of Tabu Search in order to 
prevent stagnation in local minima and loops. This method is generic, 
can apply to a large class of constraints (e.g. linear and non-linear 
arithmetic constraints, symbolic constraints, etc) and naturally copes 
with over-constrained problems. Preliminary results on some classical 
CSP problems show very encouraging performances. 

Keywords: Local search, constraint solving, combinatorial optimiza- 
tion, search algorithms. 



1 Introduction 

Heuristic (i.e. non-complete) methods have been used in Combinatorial Opti- 
mization for finding optimal or near-optimal solutions since a few decades, origi- 
nating with the pioneering work of Lin on the Traveling Salesman Problem [10]. 
In the last few years, the interest for the family of Local Search methods for 
solving large combinatorial problems has been revived, and they have attracted 
much attention from both the Operations Research and the Artificial Intelligence 
communities, see for instance the collected papers in [1] and [20], the textbook 
[12] for a general introduction, or (for the French speaking reader) [8] for a 
good survey. Although local search techniques have been associated with basic 
hill-climbing or greedy algorithms, this term now encompasses a larger class of 
more complex methods, the most well-known instances being simulated anneal- 
ing, Tabu search and genetic algorithms, usually referred as “meta-heuristics”. 
They work by iterative improvement over an initial state and are thus anytime 
algorithms well-suited to reactive environments. Consider an optimization prob- 
lem with cost function which makes it possible to evaluate the quality of a given 
configuration (assignment of variables to current values) and a transition func- 
tion that defines for each configuration a set of “neighbors” . The basic algorithm 
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consists in starting from a random configuration, explore the neighborhood, se- 
lect an adequate neighbor and then move to the best candidate. This process 
will continue until some satisfactory solution is found. To avoid being trapped in 
local optima, adequate mechanisms should be introduced, such as the adaptive 
memory of Tabu search, the cooling schedule of simulated annealing or simi- 
lar stochastic mechanisms. Very good results have been achieved by dedicated 
and finely tuned local search methods for many problems such as the Traveling 
Salesman Problem, scheduling, vehicle routing, cutting stock, etc. Indeed, such 
techniques are now the most promising approaches for dealing with very large 
search spaces, when the problem is too big to be solved by complete methods 
such as constraint solving techniques. 

In the last years, the application of local search techniques for constraint 
solving started to raise some interest in the CSP community. Localizer [13,14] 
proposed a general language to state different kinds of local search heuristics 
and applied it to both OR and CSP problems, and [18] integrated a constraint 
solving component into a local search method for using constraint propagation in 
order to reduce the size of the neighborhoods. GENET [4] was based on the Min- 
Conflict [16] heuristics, while [17] proposed a Tabu-based local search method 
as a general problem solver but this approach required a binary encoding of 
constraints and was limited to linear inequalities. Very recently, [7] developed 
another Tabu-based local search method for constraint solving. This method, 
developed independently of our adaptive search approach, also used so-called 
“penalties” on constraints that are similar to the notion of “constraint errors” 
that will be described later. It is worth noticing that the first use of such a 
concept is to be found in [2]. 

We propose a new heuristic method called Adaptive Search for solving Con- 
straint Satisfaction Problem. Our method can be seen as belonging to the GSAT 
[22], Walksat [23] and Wsat(OIP) [26] family of local search methods. But the 
key idea of our approach is to take into account the structure of the problem 
given by the CSP description, and to use in particular variable-based informa- 
tion to design general meta-heuristics. This makes it possible to naturally cope 
with heterogeneous problem descriptions, closer to real-life application that pure 
regular academic problems. 

Preliminary results on classical CSP benchmarks such as the simple “N- 
queens” problem or the much harder “magic square” or “all-intervals” problems 
show that the adaptive search method performs very well w.r.t. traditional con- 
straint solving systems. 



2 Adaptive Search 

The input of the method is a problem in CSP format, that is, a set of variables 
with their (finite) domains of possible values and a set of constraints over these 
variables. A constraint is simply a logical relation between several unknowns, 
these unknowns being variables that should take values in some specific domain 
of interest. A constraint thus restricts the degrees of freedom (possible values) 
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the unknowns can take; it represents some partial information relating the ob- 
jects of interest. Constraint Solving and Programming has proved to be very 
successful for Problem Solving and Combinatorial Optimization applications, 
by combining the declarativity of a high-level language with the efficiency of 
specialized algorithms for constraint solving, borrowing sometimes techniques 
from Operations Research and Numerical Analysis [21]. Several efficient con- 
straint solving systems for finite domain constraints now exists, such as ILOG 
Solver [19] on the commercial side or clp(FD)[3] and GNU-Prolog [5] on the 
academic/freeware side. Although we will completely depart in adaptive search 
from the classical constraint solving techniques (i.e. Arc-Consistency and its ex- 
tensions), we will take advantage of the formulation of a problem as a CSP. Such 
representation indeed makes it possible to structure the problem in terms of vari- 
ables and constraints and to analyze the current configuration (assignment of 
variables to values in their domains) more carefully than a global cost function 
to be optimized, e.g. the number of constraints that are not satisfied. Accurate 
information can be collected by inspecting constraints (that typically involve 
only a subset of all the problem variables) and combining this information on 
variables (that typically appear in only a subset of all the problem constraints) . 

Our method is not limited to any specific type of constraint, e.g. linear con- 
straints as classical linear programming or [26]. However we need, for each con- 
straint, an error function that will give an indication on how much the constraint 
is violated. Consider a n-ary constraint C(Xi , . . . , X„) and domains D \, . . . , 
for variables {Xi, . . . ,Xn}. An error function fc associated to the constraint C 
is simply a real- valued function from Di x ... x Dn such that fc{Xi, . . . , Xn) 
has value zero if C{Xi , . . . , X„) is satisfied. We do not impose the value of fc 
to be different from zero when the constraint is not satisfied, and this will in- 
deed not be the case in some of the examples we will describe below. The error 
function will in fact be used as a heuristic value to represent the “degree of sat- 
isfaction” of a constraint and thus to check how much a constraint is violated 
by a given tuple. For instance the error function associated to an arithmetic 
constraint |Ai — Y| < C will be max(0, \X — Y\ — C). Adaptive search relies 
on iterative repair, based on variables and constraint errors information, seeking 
to reduce the error on the worse variable so far. The basic idea is to compute 
the error function of each constraint, then combine for each variable the errors 
of all constraints in which it appears, therefore projecting constraint errors on 
involved variables. Finally, the variable with the maximal error will be chosen 
as a ’’culprit” and thus its value will be modified. In this second step we use 
the well-known min-conflict heuristics [16] and select the value in the variable 
domain that has the best error immediate value, that is, the value for which the 
total error in the next configuration is minimal. This is similar to the steepest 
ascent heuristics for traditionnal hillclimbing. 

In order to prevent being trapped in local minima, the adaptive search 
method also includes an adaptive memory as in Tabu Search : each variable 
leading to a local minimum is marked and cannot be chosen for the few next 
iterations. A local minimum is a configuration for which none of the neighbor im- 
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prove the current configuration. This corresponds in adaptive search to a variable 
whose current value is better than all alternative values in its domain. It is worth 
noticing that conversely to most Tabu-based methods (e.g. [6] or [7] for a CSP- 
oriented framework) we mark variables and not couples < variable, value >, 
and that we do not systematically mark variables when chosen in the current 
iteration but only when they lead to a local minimum. Observe however that, 
as we use the min-conflict heuristics, the method will never choose the same 
variable twice in a row. 

It is worth noticing that the adaptive search method is thus a generic frame- 
work parametrized by three components : 

— A family of error functions for constraints (one for each type of constraint) 

— An operation to combine for a variable the errors of all constraints in which 
it appears 

— A cost function for a evaluating configurations 

In general the last component can be derived from the first two one. Also, 
we could require the combination operation to be associative and commutative. 



3 General Algorithm 

Let us first detail the basic loop of the adaptive search algorithm, and then 
present some extra control parameters to tune the search process. 

Input : 

Problem given in CSP form : 

— a set of variables V = {Pi, V 2 , • ■ • , !«} with associated domains of values 

— a set of constraints C = (Ci, C 2 , . . . , C^} with associated error functions 

— a combination function to project constraint errors on variables 

— a (positive) cost function to minimize 

Output : 

a sequence of moves (modification of the value of one of the variables) that will 
lead to a solution of the CSP (configuration where all constraints are satisfied) 
if the CSP is satisfied or to a partial solution of minimal cost otherwise. 

Algorithm : 

Start from a random assignment of variables in V 
Repeat 

1. Compute errors of all constraints in C and combine errors on each vari- 
able by considering for a given variable only the constraints on which it 
appears. 

2. select the variable X (not marked as Tabu) with highest error and evaluate 
costs of possible moves from X 

3. if no improving move exists 

then mark X tabu for a given number of iterations 
else select the best move (min-conflict) and change the value of X accord- 
ingly 
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until a solution is found or a maximal number of iterations is reached 
Some extra parameters can be introduced in the above framework in order 
to control the search, in particular the handling of (partial) restarts. One first 
has to precise, for a given problem, the Tabu tenure of each variable, that is, 
the number of iteration a variable should not be modified once it is marked due 
to local minima. Thus, in order to avoid being trapped with a large number of 
Tabu variables and therefore no possible diversification, we decide to randomly 
reset a certain amount of variables when a given number of variables are Tabu 
at the same time. We thereafter introduce two other parameters : the reset limit, 
i.e. the number of simultaneous Tabu variables to reach in order to randomly 
reset a certain ratio of variables (reset percentage) . Finally, as in all local search 
methods, we parametrize the algorithm with a maximal number of iterations 
(max iterations). This could be used to perform early restart, as advocated by 
[23] . Such a loop will be executed at most max restart times before the algorithm 
stops. 

This method, although very simple, is nevertheless very efficient to solve 
complex combinatorial problems such as classical CSPs, as we will see in the 
next section. It is also worth noticing that this method has several sources of 
stochasticity. First, in the core algorithm, both in the selection of the variable 
and in the selection of the value for breaking ties between equivalent choices (e.g. 
choosing between two variables that have the same value for the combination 
of their respective constraint errors). Second, in the extra control parameters 
that have just been introduced, to be tuned by the user for each application. For 
instance if the reset limit (number of simultaneous tabu variables) is very low, 
the algorithm with restart very often, enhancing thus the stochastic aspects of 
the method; but on the other hand if the reset limit is too high, the method might 
show some trashing behavior and have difficulties in escaping local minima. Last 
but not least, when performing a restart, the algorithm will randomly modify the 
values of a given percentage of randomly chosen variables (the reset percentage). 
Thus a reset percentage of 100 % will amount to restart each time from scratch. 

4 Examples 

Let us now detail how the adaptive search method performs on some classical 
CSP examples. We have tried to choose benchmarks on which other constraint 
solving methods have been applied in order to obtain comparison data, but is 
it worth noticing that all these benchmarks have satisfiable instances. For each 
benchmark we give a brief description of the problem and its modeling in the 
adaptive search approach. Then, we present performance data averaged on 10 
executions, including: 

— instance number (i.e. problem size) 

— average, best and worst CPU time 

— total number of iterations (within a single run, on average) 

— number of local minima reached (within a single run, on average) 

— number of performed swaps (within a single run, on average) 
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— number of resets (within a single run, on average) 

Then we compare those performance results (essentially the execution time) 
with other methods among the most well-known constraint solving techniques: 
constraint programming systems [5,19], general local search system [13,14], Ant- 
Colony Optimization [24] . We have thus favored academic benchmarks over ran- 
domly generated problems in order to compare to literature data. 

Obviously, this comparison is preliminary and not complete but it should 
give the reader a rough idea of the potential of the adaptive search approach. 
We intend to make a more exhaustive comparison in the near future. 



4.1 Magic to the Square 

The magic square puzzle consists in placing on a NxN square all the numbers 
in { 1 , 2 ,..., N'^} such as the sum of the numbers in all rows, columns and the 
two diagonal are the same. It can therefore be modeled in CSP by considering 
iV^ variables with initial domains (1, 2, . . . , fV^} together with linear equation 
constraints and a global alLdifferent constraint stating that all variables should 
have a different value. The constant value that should be the sum of all rows, 
columns and the two diagonals can be easily computed to be N{N'^ + l)/2. 

The instance of adaptive search for this problem is defined as follows. The 
error function of an equation Xi + X 2 -I- ... -I- Xk = b is defined as the value 
of Xi + X 2 -I- ... -I- Xk — b. The combination operation is the absolute value of 
the sum of errors (and not the sum of the absolute values, which would be less 
informative : errors with the same sign should add up as they lead to compatible 
modifications of the variable, but not errors of opposite signs). The overall cost 
function is the addition of absolute values of the errors of all constraints The 
method will start by a random assignment of all N'^ numbers in ( 1 , 2 ,..., N'^} 
on the cells of the NxN square and consider as possible moves all swaps between 
two values. 

The method can be best described by the following example which depicts 
information computed on a 4x4 square: 



Values and 
Constraint errors 



Projections 
on variables 



Costs of next 
configurations 



39 


54 


51 


33 


53 


67 


61 


41 


45 


57 


57 


66 


77 


43 


48 


41 



11 


7 


“8 


15 


7 


8 


2 


1 


8 


16 


2 


4 


12 


0 


4 


8 


16 


9 


10 


6 


5 


3 


-10 


6 


23 


21 


1 


1 


14 


9 


13 


3 


1 


2 


5 


9 



4-5-8 9 -3 



The table on the left shows the configuration of the magic square at some 
iteration (each variable corresponds to a cell of the magic square). Numbers on 
the right of rows and diagonals, and below lines, denote the errors of the corre- 
sponding constraints. The total cost is then computed as the sum of the absolutes 
values of those constraints errors and is equal to 57. The table immediately on 
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the right shows the combination of constraint errors on each variable. The cell 
(3,2) with value 6 (in bold font on the left table) has maximal error (23) and is 
thus selected for swapping. We should now score all possible swaps with other 
numbers in the square; this is depicted in the table on the right, containing the 
cost value of the overall configuration for each swap. The cell (1,4) with value 15 
(in italic) gives the best next configuration (with cost 33 ) and is thus selected 
to perform a move. The cost of the next configuration will therefore be 33. 



Table 1. Magic square results 



problem 

instance 


time (sec) 
avg of 10 


time (sec) 
best 


time (sec) 
worst 


nb 

iterations 


nb local 
minima 


nb 

swaps 


nb 

resets 


10x10 


0.13 


0.03 


0.21 


6219 


2354 


3864 


1218 


20x20 


3.41 


0.10 


7.35 


47357 


21160 


26196 


11328 


30x30 


18.09 


0.67 


52.51 


116917 


54919 


61998 


29601 


40x40 


58.07 


10.13 


166.74 


216477 


102642 


113835 


56032 


50x50 


203.42 


44.51 


648.25 


487749 


233040 


254708 


128625 



Table 1 details the performances of this algorithm on bigger instances. For a 
problem of size NxN the following settings are used: Tabu tenure is equal to N-1 
and 10 % of the variables are reset when A^^/6 variables are Tabu. The programs 
were executed on a Pentiumlll 733 MHz with 192 Mb of memory running Linux 
RedHat 7.0. 




Fig. 1. Magic square graph 
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Figure 1 depicts how CPU times evolve w.r.t. problem size: the dotted line 
represents the best execution time, the dashed line the worst one and the solid 
line the average one. 



Table 2. Comparison with Localizer 



size 


Localizer 


Adaptive 


16x16 


50.82 


1.14 


20x20 


313.2 


3.4 


30x30 


1969 


18.09 


40x40 


8553 


58.07 


50x50 


23158 


203.4 



Constraint programming systems such as GNU-Prolog or ILOG Solver per- 
form poorly on this benchmark and cannot solve instances greater than 10x10. 
We can nevertheless compare with another local search method: this benchmark 
has been attacked by the Localizer system with a Tabu-like meta-heuristics. 
Localizer [13,14,15] is a general framework and language for expressing local 
search strategies which are then compiled into C-|— I- code. Table 2 compares the 
CPU times for both systems (in seconds). Timings for Localizer come from [15] 
and have been measured on a PentiumIII-800 and thus on a machine similar to 
ours (PentiumIII-733), but it is worth noticing however that the method used 
in Localizer consists in exploring at each iteration step the whole single-value 
exchange neighborhood (of size n^). Our results compare very favorably with 
those obtained with the Localizer system, as the adaptive search is two orders 
of magnitude faster. Moreover its performances could certainly be improved by 
careful tuning of various parameters (global cost function. Tabu tenure, reset 
level and percentage of reset variables, ...) in order to make the method truly 
adaptive indeed... 



4.2 God Saves the Queens 

This puzzle consists in placing N queens on a NxN chessboard so that no two 
queens attach each other. It can be modeled by N variables (that is, one for 
each queen) with domains {1,2, ...,A^} (that is, considering that each queen 
should be placed on a different row) and 3 x N{N — 1) /2 disequation constraints 
stating that no pair of queens can ever be on the same column, up-diagonal or 
down-diagonal : 

V(z, j) G (1,2, .. . ,iV}^, s.t. i ^ j ■■ Qt ^ Qj, Qt+i^ Qj +j, Qi-i^ Qj-j 

Observe that this problem can also be encoded with three all-different global 
constraints. 

We can define the error function for disequation as follows, in the most simple 
way : 0 if the constraint is satisfied and 1 if the constraint is violated. The 
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combination operation on variables is simply the addition, and the overall cost 
function is the sum of the costs of all constraints. 



Table 3. N-Queens results 



problem 

instance 


time (sec) 
avg of 10 


time (sec) 
best 


time (sec) 
worst 


nb 

iterations 


nb local 
minima 


nb 

swaps 


nb 

resets 


100 


0.00 


0.00 


0.01 


30 


0 


29 


0 


200 


0.01 


0.00 


0.01 


50 


0 


50 


0 


500 


0.04 


0.04 


0.05 


114 


0 


114 


0 


1000 


0.14 


0.13 


0.15 


211 


0 


211 


0 


2000 


0.53 


0.50 


0.54 


402 


0 


402 


0 


3000 


1.16 


1.13 


1.19 


592 


0 


592 


0 


4000 


2.05 


2.01 


2.08 


785 


0 


785 


0 


5000 


3.24 


3.19 


3.28 


968 


0 


968 


0 


7000 


6.81 


6.74 


6.98 


1356 


0 


1356 


0 


10000 


13.96 


13.81 


14.38 


1913 


0 


1913 


0 


20000 


82.47 


81.40 


83.81 


3796 


0 


3796 


0 


30000 


220.08 


218.18 


221.14 


5670 


0 


5670 


0 


40000 


441.93 


439.54 


444.84 


7571 


0 


7571 


0 


100000 


3369.89 


3356.75 


3395.45 


18846 


0 


18846 


0 



Table 3 details the performances of this algorithm on large instances. For a 
problem of size NxN the following settings are used: Tabu tenure is equal to 2 
and 10 % of the variables are reset when N/5 variables are Tabu. The programs 
were executed on a Pentiumlll 733 MHz with 192 Mb of memory running Linux 
RedHat 7.0. 

Figure 2 depicts how CPU times evolve w.r.t. problem size: the dotted line 
represents the best execution time, the dashed line the worst one and the solid 
line the average one. 

Surprisingly the behavior of the adaptive search is almost linear and the 
variance between different executions is quasi inexistent. Let us now compare 
with a constraint programming system (ILOG solver) and an ant colony opti- 
mization method (Ant-P solver), both timings (in seconds) are taken from [24] 
and divided by a factor 7 corresponding to the SPECint 95 ratio between the 
processors. Timings for Localizer come again from [15] and have been measured 
on a PentiumIII-800 and thus on a machine slighty more performant than ours. 
Table 4 clearly show that adaptive search is much more performant on this 
benchmark, which might not be very representative of real-life applications but 
is a not-to-be-missed CSP favorite... 



4.3 All-Intervals Series 

Although looking like a pure combinatorial search problem, this benchmark is 
in fact a well-known exercise in music composition [25]. The idea is to com- 
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pose a sequence of N notes such that all notes are different and tonal intervals 
between consecutive notes are also distinct. This problem can be modeled as a 
permutation of the N first integers such that the absolute difference between two 
consecutive pairs of numbers are all different. 

This problem is modeled by considering N variables {Vi, . . . , Vn} that repre- 
sent the notes, whose values will represent a permutation of {0, . . . , iV— 1}. There 
is only one constraint to encode stating that absolute values between each pair 
of consecutive variables are all different. Possible moves from one configuration 
consist in all possible swaps between the values of two variables. As all variables 
appear symmetrically in this constraint there is no need to project errors on 
each variable (all variable errors would be equal) and we just have to compute 
the total cost for each configuration. One way to do this is to first compute the 
distance between 2 consecutive variables: 

Di = \Vi+i - Vi\ for i G [l,n - 1] 

Then one has to define the number of occurrence of each distance value: 

Occj = Di = j then 1 else 0) 

Obviously, the all-different constraint on the distance values is satisfied iff for all 
j G [l,n — 1], OcCj = 1. It is thus interesting to focus on the values j such that 
OcCj = 0 representing the “missing values” for the distances. We will moreover 
consider that it is harder to place bigger distances and thus introduce a bias in 
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Table 4. Comparison with ILOG Solver, Ant-P and Localizer 



size 


ILOG 


Ant-P 


Localizer 


Adaptive 


50 


0.09 


0.39 




0.00 


100 


0.07 


2.58 




0.00 


150 


79.3 


20.6 




0.00 


200 


36.6 


40.37 




0.01 


256 


* 


* 


0.16 


0.01 


512 


* 


* 


0.43 


0.04 


1024 


* 


* 


1.39 


0.15 


2048 


* 


* 


5.04 


0.54 


4096 


* 


* 


18.58 


2.14 


8192 


* 


* 


68.58 


9.01 


16384 


* 


* 


260.8 


46.29 


32768 


* 


* 


1001 


270.2 


65536 


* 


* 


10096 


1320 



the total cost as follows: 

cost = Occj = 0 then j else 0) 

Obviously a solution is found when cost = 0. 

Table 5 details the performances of this algorithm on several instances. For 
a problem of size N the following settings are used: Tabu tenure is equal to 
iV/10 and 10 % of the variables are reset when 1 variable is Tabu. The programs 
were executed on a Pentiumlll 733 MHz with 192 Mb of memory running Linux 
RedHat 7.0. 



Table 5. All-intervals series result 



problem 

instance 


time (sec) 
avg of 10 


time (sec) 
best 


time (sec) 
worst 


nb 

iterations 


nb local 
minima 


nb 

swaps 


nb 

resets 


10 


0.00 


0.00 


0.00 


14 


6 


8 


3 


12 


0.00 


0.00 


0.01 


46 


20 


25 


20 


14 


0.00 


0.00 


0.01 


85 


38 


46 


38 


16 


0.01 


0.00 


0.02 


191 


88 


103 


88 


18 


0.04 


0.01 


0.08 


684 


316 


367 


316 


20 


0.06 


0.00 


0.17 


721 


318 


403 


320 


22 


0.16 


0.04 


0.36 


1519 


527 


992 


791 


24 


0.70 


0.10 


2.42 


5278 


1816 


3461 


2724 


26 


3.52 


0.36 


9.26 


21530 


7335 


14194 


11003 


28 


10.61 


1.38 


25.00 


53065 


18004 


35061 


27006 


30 


63.52 


9.49 


174.79 


268041 


89518 


178523 


134308 
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Figure 3 depicts how CPU times evolves w.r.t. problem size: the dotted line 
represents the best execution time, the dashed line the worst one and the solid 
line the average one. 




Let us now compare with a constraint programming system (ILOG solver) 
and an ant colony optimization method (Ant-P solver), both timings are taken 
from [24] and divided by a factor 7 corresponding to the SPECint 95 ratio be- 
tween the processors. ILOG Solver might take advantage of global constraints to 
model this problem, but nevertheless perform poorly and can only find (without 
any backtracking) the trivial solution : 

< 0,A^- l,l,A^-2,2,fV-3, ... > 

For instances greater than 16, no other solution can be found in reasonable time: 
[24] reported that the execution times where greater than a full hour of CPU 
time (this is depicted by a * symbol in our table). 

Adaptive search is therefore more than an order of magnitude faster than 
Ant-Colony Optimization on this problem (see table 6, where timings are given 
in seconds). 

4.4 Number Partitioning 

This problem consists in finding a partition of numbers {!,..., N} into two 
groups A and B such that: 
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Table 6. Comparison with ILOG Solver and Ant-P 



size 


ILOG 


Ant-P 


Adaptive 


14 


4.18 


0.07 


0.00 


16 


126.05 


0.28 


0.01 


18 


* 


0.52 


0.04 


20 


* 


1.48 


0.06 


22 


* 


2.94 


0.16 


24 


* 


9.28 


0.70 



— A and B have the same cardinality 

— sum of numbers in A = sum of numbers in B 

— sum of squares of numbers in A = sum of squares of numbers in B 

This problem admits a solution iff N is a multiple of 8 and is modeled as 
follows. Each configuration consists in the partition of the values Vi G {1, . . . , A^} 
in two subsets of equal size. There are two constraints : 

= JV(N+l)/2 

= fv(iv + i)( 2 iv + i)/6 

The possible moves from one configuration consist in all possible swaps exchang- 
ing one value in the first subset with another one in the second subset. The errors 
on the equality constraints are computed as previously in the magic square prob- 
lem. In this problem again, as in the previous all-intervals example, all variables 
play the same role and there is no need to project errors on variables. The total 
cost of a configuration can be obtained as the sum of the absolute values of all 
constraint errors. Obviously again, a solution is found when the total cost is 
equal to zero. 



Table 7. Number partitioning results 



problem 

instance 


time (sec) 
avg of 10 


time (sec) 
best 


time (sec) 
worst 


nb 

iterations 


nb local 
minima 


nb 

swaps 


nb 

resets 


80 


0.01 


0.00 


0.02 


169 


00 

o 


61 


123 


120 


0.02 


0.01 


0.04 


194 


118 


76 


177 


200 


0.11 


0.06 


0.19 


383 


216 


166 


433 


512 


1.13 


0.07 


3.26 


721 


348 


372 


1918 


600 


1.86 


0.02 


8.72 


870 


414 


456 


2484 


720 


4.46 


0.54 


21.12 


1464 


680 


784 


5101 


800 


6.41 


0.26 


12.55 


1717 


798 


919 


6385 


1000 


8.01 


2.44 


17.25 


1400 


630 


769 


6306 



Table 7 details the performances of this algorithm on several instances. For 
a problem of size N the following settings are used: Tabu tenure is equal to 2 
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and 2 % of the variables are reset when one variable is Tabu. The programs 
were executed on a Pentiumlll 733 MHz with 192 Mb of memory running Linux 
RedHat 7.0. Figure 4 depicts how CPU times evolve w.r.t. problem size: the 
dotted line represents the best execution time, the dashed line the worst one 
and the solid line the average one. Constraint Programming systems such as 
GNU Prolog cannot solve this problem for instances larger than 128. 




Fig. 4. Number partitioning graph 



4.5 The Alpha Cipher 

This problem has been posted on the newsgroup rec. puzzles a few years ago, 
it consists in solving a system of 20 simultaneous equations over the integers 
as follows. The numbers {1,...,26} have to be assigned to the letters of the 
alphabet. The numbers beside each word are the total of the values assigned 
to the letters in the word, e.g for LYRE L,Y,R,E might equal 5,9,20 and 13 
respectively or any other combination that add up to 47. The problem consists 
in finding the value of each letter satisfying the following equations: 

BALLET = 45 GLEE = 66 POLKA = 59 SONG = 61 

CELLO = 43 JAZZ = 58 QUARTET = 50 SOPRANO = 82 

CONCERT = 74 LYRE = 47 SAXOPHONE = 134 THEME = 72 
FLUTE = 30 OBOE = 53 SCALE = 51 VIOLIN = 100 

FUGUE = 50 OPERA = 65 SOLO = 37 WALTZ = 34 
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This is obviously modeled by a set of 20 linear equations on 26 variables. 
The errors on the linear constraints are computed as previously in the magic 
square example. The projection on variables is the absolute value of the sum of 
each constraint error multiplied by the coefficient of the variable in that (linear) 
constraint. The total cost is, as usual, the sum of the absolute values of constraint 
errors. 

Local search is certainly not the best way to solve such a (linear) problem. 
Nevertheless it could be interesting to see the performances of adaptive search 
on such a benchmark in order to observe the versatility of this method. Table 8 
details the performances of this algorithm. The following settings are used: Tabu 
tenure is equal to 1 and 5 % of the variables are reset when 6 variables are Tabu. 
The programs were executed on a Pentiumlll 733 MHz with 192 Mb of memory 
running Linux RedHat 7.0. 



Table 8. Alpha cipher result 



problem 

instance 


time (sec) 
avg of 10 


time (sec) 
best 


time (sec) 
worst 


nb 

iterations 


nb local 
minima 


nb 

swaps 


nb 

resets 


alpha-26 


0.08 


0.03 


0.18 


5419 


3648 


1770 


751 



Constraint Programming systems such as GNU Prolog can solve this problem 
in 0.25 seconds with standard labeling and in 0.01 seconds with the first-fail 
labeling heuristics. Surprisingly, adaptive search is not so bad on this example, 
which is clearly out of the scope of its main application domain. 

5 Conclusion and Perspectives 

We have presented a new heuristic method called adaptive search for solving Con- 
straint Satisfaction Problems by local search. This method is generic, domain- 
independent, and uses the structure of the problem in terms of constraints and 
variables to guide the search. It can apply to a large class of constraints (e.g. 
linear and non-linear arithmetic constraints, symbolic constraints, etc) and natu- 
rally copes with over-constrained problems. Preliminary results on some classical 
CSP problems show very encouraging results, about one or two orders of magni- 
tude faster than competing methods on large benchmarks. Nevertheless, further 
testing is obviously needed to assess these results. 

It is also worth noticing that the current method does not perform any plan- 
ning, as it only computes the move for the next time step out of all possible 
current moves. It only performs a move if it immediately improves the overall 
cost of the configuration, or it performs a random move to escape a local mini- 
mum. A simple extension would be to allow some limited planning capability by 
considering not only the immediate neighbors (i.e. nodes at distance 1) but all 
configurations on paths up to some predefined distance (e.g. all nodes within at 
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distance less than or equal to some k), and then choose to move to the neighbor 
in the direction of the most promising node, in the spirit of variable-depth search 
[11] or limited discrepancy search [9]. We plan to include such an extension in 
our model and evaluate its impact. Further work is needed to assess the method, 
and we plan to develop a more complete performance evaluation, in particular 
concerning the robustness of the method, and to better investigate the influ- 
ence of stochastic aspects and parameter tuning of the method. Future work will 
include the development of dynamic, self-tuning algorithms. 
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Abstract. In the material flow of a plant, parts are grouped in batches, 
each having as attributes the shape and the color. In both departments, 
a changeover occurs when the attribute of a new part changes. The 
problem consists in finding a common sequence of batches optimizing 
an overall utility index. A metaheuristic approach is presented which 
allows to solve a set of real-life instances and performs satisfactorily on 
a large sample of experimental data. 

Keywords: Evolutionary algorithms, sequencing, manufacturing sys- 
tems. 



1 Introduction 

This paper deals with a problem of a serial production system concerning the 
way in which two consecutive stages should organize their internal work, taking 
into account the requirements of the other stage. 

The models considered in this paper are related to an application in a real 
industrial context, namely a plant in which a large number of different panels 
of wood are cut, painted and assembled to build furniture parts. In different 
departments of the plant, items are grouped according to different attributes. In 
the cutting department, parts are grouped according to their shape and material, 
in the subsequent painting department parts are grouped according to their 
color, and finally the assembly of the furniture is organized on the basis of the 
different products. This paper focus on the coordination issues between the first 
two stages, and therefore we are only interested in two attributes, called for 
simplicity shape and color. 

In both departments, a changeover occurs when the attribute of a new part 
changes. If a part must be cut having a different shape from the previous one, 
cutting machinery must be reconfigured. Similarly, the painting station, when 
a new color is used, must be cleaned in order to eliminate the residuals of the 
previous color. In both cases costs are incurred in terms of time and manpower. 

Limited interstage buffering between the two stages forces the sequences in 
which the items are processed in the two stages to be coordinated. In this paper, 
we analyze the case in which the two departments follow the same common 

K. Steinhofel (Ed.): SAGA 2001, LNCS 2264, pp. 91-105, 2001. 
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sequence. Each department would like to minimize its own total changeover 
cost, but the objectives of the two departments may be conflicting. 

Other issues may exist in real life situations. For example, changeovers may 
have different costs. In the painting department, setup operation takes less time 
to switch from a lighter to a darker color than viceversa. In some cases, there 
may even be precedence constraints that limit the set of feasible sequences in 
the two departments. On the other hand, there are some fixed activities in- 
volved in a changeover which are very much the same, regardless of the partic- 
ular changeover, and the contribution of these fixed activities to the setup cost 
may be extremely relevant. Hence, the total number of changeovers is usually a 
meaningful index of performance. 

In this paper, the problem in its basic version is addressed, i.e., when all the 
changeovers in each department and across departments have the same cost and 
no resequencing is possible between the two stages. Hence, the focus is on the 
number of changeovers, in order to establish a single common sequence of items 
which both stages will follow. Even so, the problem turns out to be hard and 
calls for a heuristic solution approach [1]. 

Each item to be produced is characterized by its own shape and color. All 
the items having the same shape and color form a single batch. In fact, there 
is no convenience, in splitting such batches. In the first (second) department, a 
changeover is paid when the new batch has a different shape (color) from the 
previous one. Otherwise, no changeover is incurred. Note that since we want to 
minimize the number of changeovers, the actual cardinality of each batch is of 
no interest at all. 

Each given sequence of the batches results in a number of changeovers to be 
paid by each department. In particular two different objectives are considered, 
namely: i) minimize the total number of changeovers in the two departments 
(MTC); ii) minimize the maximum number of changeovers paid by either de- 
partment (MMC). 

While the former objective corresponds to the maximization of overall utility, 
the latter may better capture the need to balance the changeover costs between 
the two departments. The interest may be also in the combination of different 
objectives introducing a multicriteria optimization problem. 



2 Problem Formulation 

The problem we consider can be formulated as follows. Let A be a set of batches 
to be produced. The batches must be processed by two departments of the 
plant, called Ds and Dc, in the same order. Each batch is characterized by 
two attributes, say shape and color. Let S and C denote the sets of all possible 
shapes and colors respectively. We denote the shapes as Sj, i = 1, . . . , [S'] and 
the colors as Cj, j = 1, . . . , |C|. Each batch is therefore defined by a pair (si,Cj). 
If batch (si,Cj) is processed immediately after batch (sh,Ck), a changeover is 
paid in department Ds if Sh yf Si, and a changeover is paid in department Dq 
if Ck yf Cj. We can represent the input as a bipartite graph, B = (S', C, A), in 
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which nodes in S correspond to shapes, nodes in C to colors and each edge of A 
corresponds to a batch to be produced. The problem is to sequence the batches 
in a profitable way from the viewpoint of the number of changeovers. This means 
that we must find an ordering cr of the edges of B. If two consecutive edges (i,j) 
and (h, k) in a have no nodes in common, this means that both departments 
have to pay one changeover when switching from batch (i, j) to batch (h, k). We 
refer to this as a global changeover. On the other hand, if i = h {j = k), only 
department Dc (Ds) pays a changeover. This is called local changeover. For a 
given sequence cr, we can therefore easily compute the number of change overs 
incurred by each department, call them Ns{(j) and Nc{<j) respectively. In fact, 
let 5ih be equal to 1 if i yf /i and 0 otherwise, and let s{a{q)) denote the shape 
of the g-th batch in the sequence cr, and let c{a{q)) denote its color. 

Let Ns(o') = ^s{a(q)),s{<T(q+l)) Nc{<r') = X/g=l ^c(a{q)),c(a(q+l))i 

then the problems addressed in this paper can be formulated as follows: 

— MTC: Given a bipartite graph B=(S,C,A), find a sequencing <j of the edges 
such that Ns{(j) + Nc{cr) is minimized. 

— MMC: Given a bipartite graph B=(S,G,A), find a sequencing a of the edges 
such that max{iV 5 (cr), Ac(cr)} is minimized. 
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Fig. 1. MTC and MMC: an example. 



An example is reported in Figure 1. There are 10 batches to be produced, requir- 
ing 4 shapes and 7 colors. Figures 1(a) and (b) show the optimal sequences for 
MTG and MMG respectively (the edges are numbered according to their position 
in the sequence). It is easy to see that the two objectives are not equivalent. 

The MTG problem consists in finding a sequence a of the edges of T such 
that the total number of changeovers is minimum. Glearly, MTG is equivalent 
to minimizing the number of global changeovers. An ideal solution cr/ is one 
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in which all changeovers are local, i.e., batches are sequenced so that every 
two consecutive batches have either the same shape or the same color. An ideal 
solution may not exist, but if it does, it can be characterized in graph theoretical 
terms. 

An ideal sequence on i? = (S', C, A) exists if and only if there is a dominating 
trail on B (i.e. a trail such that every edge of the graph has at least one endpoint 
on it). Note that a dominating trail completely specifies an ideal schedule, since 
in fact dominated edges can always be scheduled between two consecutive edges 
of the trail with no global changeovers. However, if an ideal sequence does not 
exist, the number of global changeovers must still be minimized. We can therefore 
express MTC as the problem of dominating the edges of B with the minimum 
number of (edge-disjoint) trails, i.e., finding a minimum cardinality dominating 
trail set (MDTS)[3]. The concept of dominating trail has received considerable 
attention in the graph theoretic literature and MDTS has been shown to be 
polynomially solvable when B is & tree or a cactus [2,7]; while, MDTS, MTC 
and MMC are shown to be NP-hard even for bipartite graphs [1,4]. 

Some formulations for MTC and MMC can be considered. Notice that, given 
the bipartite graph B = (S', C, A), MTC can always be solved finding the hamil- 
tonian path of minimum weight on a complete graph Aim (with one node for 
each edge of B), in which the weight associated with each edge {x,y) of AT|^| 
is 0 if X G A and y G A are adjacent in B and 1 otherwise. Clearly, there is 
a one-to-one correspondence between hamiltonian paths in Aim and batch se- 
quences, the weight of a hamiltonian path being equal to the number of global 
changeovers in the corresponding sequence. 

However, the size of graph can easily be too large to find the minimum 
weight hamiltonian path within the time constraints practitioners are usually 
faced with. Let the density of B be defined as |Aj/jS||C'|. A problem with 50 
shapes and 50 colors and density 0.4 results in a graph AT^i with 1000 nodes. 
Moreover, it is well known that instances of the hamiltonian path problem (as 
well as the traveling salesman problem) with 0-1 costs are particularly hard. 

Finally, in real industrial applications, it may be important to have a method 
which is time-limited, in particular if the sequencing problem has dynamic as- 
pects. In fact, in the case of large instances, by the time the problem is solved, 
it may have changed (e.g. new orders may have arrived), so a method that takes 
too long time may be inappropriate whatever the quality of the solutions it pro- 
duces for a static problem. For these reasons we do not consider ILP formulations 
based on TSP further in this paper. 

An alternative (0-1) integer programming formulation of MTC and MMC 
may be based on the approach suggested by Crama et al. [6] for the tooling and 
loading problem of a FMS. Note that the size of these ILP problems are smaller 
then those obtained in the hamiltonian path or TSP formulations. However, 
the linear programming relaxation provides lower bounds which are extremely 
weak. Hence, improvements are needed in the formulations in order to make it 
effective. These improvements may consist in some vaild constraints to add to 
the formulations [15] and may be the object for further research. 
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3 Heuristic Approaches for MDTS and MTC 

In this section two heuristic approaches are presented for MDTS and MTC 
problems. The first heuristic (DOMWALKS) performs better on graphs with 
high density, while the second heuristic (DOMTREE) obtains good results on 
sparse and tree-like instances. 

3.1 DOMWALKS Heuristic 

Agnetis et al. [2] proposed a heuristic (called DOMWALKS) with the aim to 
build a dominating trail set E, trying to keep its cardinality as low as possible. 
The approach is based on the following considerations. If graph G happens to 
have a eulerian walk, the solution to MTC is straightforward, since a eulerian 
walk is a dominating walk, and hence an ideal solution would exist. If a eulerian 
walk does not exist, the idea is to remove from G some edges so that, in the 
resulting graph G, the degree of each node is even. If G is connected, it has a 
eulerian cycle, and hence a eulerian walk. A eulerian walk in G is a domi- 
nating walk in G, since each removed edge is adjacent to some edge of We- If 
G is not connected, each component either has a eulerian cycle or consists of a 
single node. All the edges of each component can be dominated by an eulerian 
walk, but a new trail may need when moving from one component to another. 

One key feature of this approach is the edge removal phase. Given the graph 
G, an odd-vertex pairing (OVP) 7T is a set of edge-disjoint (but possibly not 
node-disjoint) paths having nodes with odd degree in G as endpoints, such that 
each node with odd degree in G is the endpoint of exactly one path, and no nodes 
of even degree are endpoints of any path. If we remove U from G, we are left with 
a graph G in which all vertices have even degree. Hence, if we find an OVP with 
few edges, chances are that G is still connected. We therefore seek an OVP with 
a minimum total number of edges. Such a pairing can be found in polynomial 
time by solving a matching problem [2], as follows. Given the graph G, let N{G) 
be the set of nodes of G, and let Nodd{G) be the set of nodes having odd 
degree in G. Consider a complete graph K\Nodd(G)\ with |iVo_D_D(G)| nodes, in 
which each edge {i,j) is weighted by the length (number of edges) of the shortest 
path from i to j in G. A perfect matching of minimum weight in K\Nodd(G)\ 
corresponds to an OVP with a minimum number of edges in G. Note that the 
optimality of such OVP implies that the paths are all edge-disjoint. 

The algorithm begins by finding an OVP U. Once the edges of II are re- 
moved, we are left with the graph G. The algorithm then starts from an empty 
trail set, and iteratively adds trails one by one until all the edges of G are dom- 
inated. We call Ec the current set of trails formed by the algorithm. The algo- 
rithm builds each trail by finding a dominating trail for one or more connected 
components of G. In order to limit the number of used trails as much as possible, 
the algorithm tries to use the edges of the OVP II to move from the current 
component to another one. Of course, no edge of 77 can be used more than once 
throughout the algorithm, so it may be the case that no path of 77 is anymore 
available to move from the current component to another. In this case a trail is 
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completed and added to the current trail set Sc, the algorithm jumps to a new 
component, and a new trail is started. This algorithmic scheme allows to build 
good heuristics and yields a simple bound to the number of global changeovers 
in an optimal solution, therefore: \MMC\ < \MTC\ < -g \E{G)\. 

When the density of the graph is higher than 25%, DOMWALKS often finds 
a dominating walk, if it exists, and it requires a low computational time even 
for large instances. For a given value of |%(G)|, the sparse instances are those 
for which DOMWALKS gives the largest number of global changeovers. This is 
not surprising, since as the number of edges increases, it is easier to establish a 
dominating walk. 

3.2 DOMTREE Heuristic 

An 0(|K(G)|) algorithm (known in literature as DOMTREE) has been proposed 
for MDTS on trees [2]. Therefore, a heuristic for a more general graph G can 
be obtained applying DOMTREE to a spanning tree of G. As spanning trees 
can be efficiently individuated [11], the overall algorithm for MDTS based on 
DOMTREE results efficient too. 

In particular, DOMTREE is based on the relationship between MDTS and 
the Hamiltonian Completion Number problem (HCN). The Hamiltonian Com- 
pletion Number of a graph G = (V, E) is the minimum number of edges which 
need to be added to G to make it hamiltonian, and will be denoted as HCN{G). 
Given a graph G, the hamiltonian completion number of its line graph L(G), 
i.e. HGN{L{G)) is called Edge Hamiltonian Completion number of G. The line 
graph L{G) of a graph G has a hamiltonian cycle if and only if G has a domi- 
nating trail. Moreover, there is a relationship between the cardinality of a min- 
imum dominating trail set of G and HCN(L(G)). Given any connected graph 
G = (V,E), the cardinality of a minimum dominating trail set \MDTS\ on G 
is k if and only if HCN{L{G)) = k — 1. DOMTREE algorithm was proposed 
for HCN{L{G)) when G is a tree, and it works also for MDTS on trees [2]. 
This approach yields good experimental results on sparse and almost tree-like 
instances. 

4 An Evolutionary Algorithm for MTC and MMC 

Evolutionary algorithms (EA) are optimization techniques inspired from natu- 
ral processes [8]. They handle a population of individuals that evolve with the 
help of information exchange procedures. Each individual may also evolve in- 
dependently. This behavior alternates periods of cooperation with periods of 
self-adaptation. The population of an EA iteratively evolves according to some 
specific rules. The proposed algorithmic approach has an evolution process based 
on a genetic algorithm. The concepts of genetic algorithms (GAs) have been ap- 
plied successfully to problems involving pattern recognition, classification and 
other problems in the domain of Artificial Intelligence. Some literature in the 
recent years reports about applications to large combinatorial [13,5] and indus- 
trial scheduling problems [14,10]. This section attempts to apply EAs to two 
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such problems, namely MTC and MMC. GAs, even in their basic version, have 
many possible choices of control parameters, but the emphasis here is on show- 
ing the feasibility of using GAs in an evolutionary strategy for such problems by 
producing a working algorithm, rather than on a search for the optimal param- 
eter settings. This approach provides a way to deal with these problems, while 
the performance may depend on strategies adopted and particular settings of 
algorithm’s parameters. 

4.1 The Genetic Evolutionary Mechanism 

A part of the research community is involved in the design and development of 
heuristics for NP-hard problems. One such problem, known to be NP-hard, is the 
permutation flowshop sequencing problem in which n jobs have to be processed 
(in the same order) on m machines. The object is to find the permutation of jobs 
which will minimize the makespan, i.e. the time at which the last job is completed 
on machine m. For this problem Reeves [14] proposes a GA solution scheme that 
perform relatively well on large instances. In this paper, starting from the Reeves 
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Fig. 2. The scheme of the basic genetic algorithm. 



scheme, a GA approach dedicated to the permutation sequencing problems MTG 
and MMG is developed. In particular, these problems may be considered as a 
special case of permutation scheduling on a single processor in which the edges 
of the graph are the jobs to be processed, and the objective is to minimize the 
number of changeovers. Different genetic operators can be identified, the most 
commonly used ones being crossover (an exchange of sections of the parents’ 
chromosomes), and mutation (a random modification of the chromosome). In 
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the context of finding the optimal solution to a large combinatorial problem, 
a GA works by maintaining a population of M solutions which are potential 
parents, whose fitness values have been calculated. In Holland’s original GA [9], 
one parent is selected on a fitness basis (i.e. the better is the fitness value, the 
higher is the probability of it being chosen), while if crossover is to be applied, 
the other parent is chosen randomly. They are then mated producing offsprings 
to replace a part of the existing population chosen by means of a probabilistic 
rule. The reproduction is iterativelly repeated. Various versions of this procedure 
are been proposed in literature and different strategies are been adopted. 

5 The Implementation 

Other than methodological or strategic decisions, there are parametric choices to 
do: different crossover and mutation probabilities, different population sizes an 
so on. It is also quite possible that the way the initial population is chosen will 
have a significant impact on the results. In Figure 2 are depicted the traditional 
scheme of genetic algorithms. In the this section we focus on GA application in 
the evolutionary scheme for the specific problems introduced in Sec. 1 and 2. 

- Chromosomes and crossover operation - In order to apply any GA 
to a sequencing problem, there is an obvious pratical difficulty to consider. In 
most traditional GAs, the chromosomal representation is by means of a string of 
Os and Is, and the result of a genetic operator is still a valid chromosome. This 
is not the case if one uses a permutation to represent the solution of a problem, 
which is the natural representation for a sequencing problem. For example, if 
the parents are as shown below. 

Parent 1: (2, 1, 3, 4, 5, 6, 7) Offspring 1: (2, 1, 3, 2, 5, 7, 6) 

Parent 2: (4, 3, 1, 2, 5, 7, 6) Offspring 2: (4, 3, 1, 4, 5, 6, 7) 

it is clear that choosing the crossover point between the third and the fourth 
symbol, the offsprings are illegitimate because symbols do not appear once in the 
sequence. Different representation are possible, but it seemed easier to modify 
the idea of crossover and keep the permutation representation. The crossover 
(G) chose for the proposed algorithm consists of randomly selecting one point 
X along the parents’ chromosomes, taking the pre-X section of the first parent, 
and filling up the chromosome by taking in order each legitimate symbol from 
the second parent. The application of this rule to previous example yields: 

Parent 1: (2, 1, 3, 4, 5, 6, 7) Offspring 1: (2, 1, 3, 4, 5, 7, 6) 

Parent 2: (4, 3, 1, 2, 5, 7, 6) Offspring 2: (4, 3, 1, 2, 5, 6, 7) 

The idea behind G is that it preserves the absolute positions of the symbols 
taken from the first parent, and the relative positions of those from the second. 
This would provide enough scope for modification of the chromosome without 
excessively disrupting it. Preliminary experiments showed that this crossover 
rule tended to converge rather rapidly to a population of almost identical chro- 
mosomes (premature convergence), so a mutation operation is also required. In 
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other words the crossover C emphasizes exploitation instead of exploration; using 
a mutation provides a way to restore the balance. In the proposed evolutionary 
algorithm, a chromosome represents directly a complete sequence of of the edges 
of the graph G related to an instance of the problem (as reported in Figure 3). In 
particular, each gene will represent a single edge of the graph and it will have the 
same label. Also, the gene contains information about vertices and weight of the 
represented edge. The fitness of each chromosome is straightforward obtained 
scanning the sequence and calculating the number of changeovers. These char- 
acteristics make this representation preferable with respect to those based on 
a Random Key Encoding (RKE). Moreover, experiments underline three main 
drawbacks of RKE: a) the sorting process to obtain the sequence is time con- 
suming; b) the encoding does not preserve the adjacency of edges in a given 
sequence; c) the crossover operation may lose more important information. 
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Fig. 3. The representation: A) the instance; B) a chromosome and its fitness. 



- Reproduction mechanism - The selection of parents for the reproduc- 
tion process (by means of the crossover operator), is a simple ranking mecha- 
nism for the first parent, while the second parent is selected randomly over the 
current population. At this aim, the population is ordered on the basis of the 
fitness of its elements. Howerver, the ranking selection mechanism involves only 
K elements of the population with a better fitness than the current average fit- 
ness. Such a mechanism works in accordance with the probability distribution 
^'(W) “ K{K+i) ’ "'^here \i] is the ith chromosome in ascending order of fitness (for 
example, descending order of MTC or MMC), and K is the number of elements 
involved in this selection. This implies that the median value has a probability 
of of being selected, while the Kth (the fittest element) has a probability of 
(K+i) ’ ^bout twice that of the median. The ranking mechanism is also used for 
choosing elements to be eliminated at each iteration, i.e. substituted by a off- 
spring. Since we already have a ranking of the chromosomes, it is easy to choose 
from elements which have a bad fitness. 

- Mutation operation - A mutation operator was used to reduce the rate 
of convergence. In traditional GAs, mutation of a chromosome is performed by 
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flipping each of its symbols from 0 to 1 (or vice versa) with a small probability. 
Using a permutation representation, mutation needs to be defined differently. 
Three types of mutation are considered: a) exchange mutation, was a simple 
exchange of two symbols in a permutation, chosen at random; b) shift mutation, 
i.e. a shift of one symbol (again, chosen randomly) a random number of places to 
the right or to the left; c) reversal mutation consisting in change a subsequence 
of the selected chromosome with its reverse. In the latter case, choosing the sub- 
sequence between two global changeovers, the fitness value of the chromosome 
will not be affected by the mutation operator. The premature convergence prob- 
lem is one that has been noted by several workers in applying GAs, and various 
proposals have been made to prevent it. The method applyed here is to use 
an adaptive mutation rate Pm- At the start of algorithm an high probability of 
mutation is allowed; this probability is slowly decreased as long as a reasonable 
diverse population exists. If the diversity becomes too small, the original high 
value is restored. Several measures of diversity may be adopted, our choice is to 
reset the mutation rate if the ratio is below a value Dth, where / is the 

Javg 

fitness value of each chromosome (that is directly the objective of the problem 
to solve). 

- Initial population - Another consideration deals with the selection of the 
initial population. The basic GA scheme assumes that the initial population is 
chosen completely at random. It may be profitable trying the effect of seeding 
the initial population with a good solution generated by a constructive heuristic. 
In the case of the problems under study, MTG my be tackled with heuristics 
based on DOM WALKS [1] or DOMTREE [2], while for MMG we have not a 
specific heuristic, and we may try with our best MTG solution as upper bound 
for this problem. This suggests an algorithmic scheme consisting of two different 
subsequent stages, one dedicated to MTG and another to MMG. Experimental 
results show that the seeding operation allows to quite reduce the computational 
time (see also Figure 4). 

- Stopping criteria - The basic GA scheme has not a natural stopping 
point. In order to avoid too long computation time it is useful to introduce some 
stopping criteria. The simplest method consist to limit the number of algorithm 
iterations or the total computational time. This method may be refined consid- 
ering also the evolution of the algorithm and starting iteration/time counters 
only when iteration does not improve the population fitness (e.g. the best or 
average values). The stopping criteria introduced in the proposed evolutionary 
scheme takes into account some theoretical lower and upper bounds on the ob- 
jective function [12]. The first avoids to continue in the search when an optimal 
sequence is already found, while the second may force the algorithm to con- 
tinue the search in order to improve the current solution. Both bounds allow to 
evaluate the quality of the current individuals in the population. 

- Evolutionary strategy - Some parts in the algorithm have been modified 
in order to improve its performances. The changes are valuable especially on 
large instances. A particular evolution scheme is implemented in which two GAs 
evolve in parallel (starting with different initial population) in the first phase. 
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Fig. 4. Typical behaviour of the EA. 



This phase ends when the average fitness of each population converges between 
two theoretical bounds. In the second phase the populations are merged and a 
new GA may start. This scheme may be useful also in the case of the bi-criteria 
problem to find solution when both objectives (MMC and MTC) are assumed 
to be minimized. In addition to the set of randomly generated individuals, two 
seeding heuristics (i.e. DOMWALKS and DOMTREE) are used. The resulting 
scheme of the proposed EA is reported in Figure 5. In Figure 4 are described 
the typical behaviour of the basic EA (without seeding): it appears clear as the 
seeding operation (i.e. using heuristics DOMWALKS or DOMTREE) may yields 
a sensible time saving. The characteristics of the reverse mutation suggest us 
a simple way to improve also the crossover operator: when the chomosomes to 
mate are almost identical (i.e., when the number of identical elements is greater 
than a fixed value Sth), one of them is reversed before the crossing over. After 
a preliminary phase for testing and tuning the algorithm, a particular setting of 
parameters is adopted and another improvement is introduced with respect to 
the basic GA scheme. 

To improve the convergence, a local search (LS) procedure is incorporated 
into the algorithm. At any generation, LS is applied on the elitary chromosomes 
(i.e., chromosomes with good values of fitness in each population) except for 
which LS has been already applied in the previous generations. In order to elim- 
inate a global changeover, the LS procedure scans a chromosome to find two 
subsequences (Ti and CT 2 characterized as follows: a) both cti and U 2 are exactly 
delimited by two global changeovers; b) a starting (ending) subsequence ai {(}{) 
exists in ai in which one of the two nodes of the edges does not change; c) subse- 





102 



Carlo Melon! 



quences ai, /3i, a 2 and P 2 contain edges that allow to link up the subsequences 
CTi and CT 2 - After the generation of offsprings, the population is updated and LS 
is applied on new elitary elements. When the best (current) solution changes 
in one population, it is introduced in the other population as soon as it is up- 
dated. It takes advantage of good genetic information and propagate them in 
each population. 
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Fig. 5. The scheme of the evolutionary algorithm. 



6 Experimental Results 

A set of real instances as well as a wide set of artificial instances are considered 
in order to study the performance of the proposed approach. The real cases are 
characterized by several cutting classes {irova 140 to 235) and color classes (from 
20 to 30) with a density from 18 to 60%. The production of the plant follows 
an assembly-to-order strategy. Every week a new production plan is computed 
on the basis of the actual orders. The tests refer to 32 instances related to 32 
diherent production weeks. 

The EA was coded in C language and was tested on a PC Pentium II, 200 
MHz with 64MB ram. 

In all 32 instances the optimal solution (MTC) has been found by the EA. It 
takes always less than 150 seconds with an average computing time of 93 seconds. 
Using the seeding operation, these computing times decrease by a factor of about 
5. Moreover, the use of the heuristics allows to reduce the random behaviour of 
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the EA (Figures 6 and 7 report the best/worst results on 5 different runs). 
In Figure 8 are reported the best/worst performance with respect to MMC 
objective of the EA using the seeding operation. In this case the effects are less 
appreciable as both the heuristics pursuing the MTC minimization. The results 
on artificial instances provide a way to carry out the tuning of the parameters 
of the algorithm. Moreover, these experiments point out the complementary 
behaviour of the EA and heuristics (DOMWALKS and DOMTREE). In fact, 
EA performs well when the graph has low density or a small size in terms of 
number of edges. DOMWALKS gives good results when the graph’s density 
is high and may fail when the graph is particularly sparse; while DOMTREE 
performs well on cases with low density (or tree-like graphs) independently of 
the number of edges. 




Fig. 6. Experimental results 1: EA with MTC objective and random initialization. 



7 Conclusions 

Experiments are made on a wide set of instances also from real industrial cases. 
The results allow us to point out the effect of seeding and local search operations 
with respect to both the quality of the solutions and the computation time. 

The proposed EA appears to be a fast and powerful algorithm for the consid- 
ered problems. The results obtained suggest some way to adapt the evolutionary 
scheme in order to tackle other issues such as: i) the case in which setups have 
different costs as those occurring when, for instance, it takes less time to switch 
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Fig. 7. Experimental results 2: EA with MTC objective and seeding operation. 




Fig. 8. Experimental results 3: EA with MMC objective and seeding operation. 
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from a lighter to a darker color than viceversa; ii) the multicriteria problems 
in which the objective account for both the total and the maximum number 
of setups; iii) the introduction of technological precedence constraints either on 
shapes or colors. 
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Abstract. We study the combinatorial optimization task of choosing 
the smoothest map from a given family of maps, which is motivated 
from motor control unit calibration. The problem is of a particular 
interest because of its characteristics: it is NP-hard, it has a direct 
and important industrial application, it is easy-to-state and it shares 
some properties of the wellknown Ising spin glass model. Moreover, it 
is appropriate for the application of randomized algorithms: for local 
search heuristics because of its strong 2-dimensional local structure, and 
for Genetic Algorithms since there is a very natural and direct encoding 
which results in a variable alphabet. We present the problem from 
two points of view, an abstract view with a very simple definition of 
smoothness and the real-world application. We run local search, Genetic 
and Memetic Algorithms. We compare the direct encoding with unary 
and binary codings, and we try a 2-dimensional encoding. For a simple 
smoothness criterion, the Memetic Algorithm clearly performs best. 
However, if the smoothness citerion is more complex, the local search 
needs many function evaluations. Therefore we prefer the pure Genetic 
Algorithm for the application. 

Keywords: Genetic Algorithm, NP-hard, Gontrol Unit Calibration, 
Variable Alphabet Coding, Hybrid GA, Smooth Maps, Combinatorial 
Optimization. 



1 Introduction 

This paper deals with a combinatorial optimization problem that is motivated 
from the calibration of electronic control units (ECUs) for combustion engines. 
We briefly sketch the situation. During engine operation, each engine parameter^ 
is being controlled by a map that is stored in the ECU as a lookup table. In this 
way, the engine is adjusted to the actual operating situation, which is necessary 
to obtain optimal fuel consumption, exhaust emissions, etc. Technically, an op- 
erating situation is a combination of engine speed and relative air mass flow and 

* BMW Group, 80788 Miinchen, Germany. 

^ engine parameters are controlled variables that determine the behaviour of the en- 
gine, e.g. ignition timing angle 
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Fig. 1. The context: grid, candidates, and a sample map 



thus corresponds to a point in a rectangular domain, the operating range. More- 
over, the maps in the ECU are defined by a grid on the operating range, 

the operating points, and the respective co-domain values. If there are m engine 
parameters, then the control unit must store m maps For op- 

erating points that are not grid points, the ECU will interpolate the parameter 
values. (See e.g. [1] for the technical background). 

Our problem now arises in the final phase of the lookup table design. For 
each operating point xj, we are given a set of candidates 
each of which defines a combination of values for the engine parameters. These 
candidates have been obtained by a previous optimization process separately 
for each operating point, and they have approximately the same quality (fuel- 
consumption etc.). In order to populate the lookup table, one candidate has to 
be selected at each operating point. The constraint is the fact that some of the 
parameters are adjusted mechanically and thus have a certain inertia. Hence, 
one is interested in smooth maps for these parameters, in order that the value 
can be adjusted swiftly when the operating point changes. 

Note that this is a multiple objective optimization task unless m = 1, since 
we have m maps to smooth and choosing a candidate means fixing all parameter 
values at the corresponding operating point. Thus, the problem is not separable, 
i.e. it is not possible to optimize each map separately. 

Clearly the meaning of ’’smoothness” has to be defined precisely. The first 
part of this paper will assume a very simple smoothness criterion. This turns the 
problem into an easy-to-state combinatorial optimization problem with several 
interesting properties. The second part of this paper deals with the application. 

We use local search heuristics and Genetic Algorithms to find smooth maps. 
Even with a simple smoothness definition, it can be seen that a pure heuristic 
is not powerful enough to perform an efficient global optimization. On the other 
hand, the application implies more complex smoothness criteria. Here, the local 
search becomes less efficient, and other methods that require deeper insight into 
the problem structure are very expensive to develop or even intractable. However, 
we will see that a Genetic Algorithm can still produce good solutions. 
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2 The Abstract Case 

In this section, we will consider an abstract version of our problem and a very 
simple smoothness definition. 

Let G = be a set of points defined by a rectangular P by Q grid 

in For each grid point Xj, a number of candidates {y^^}2Li C R is given 
(see Fig. 1). Then, M{xj) = defines a map from G into R, this is the map 

obtained by choosing the candidate y^^ at the grid point Xj with 1 < kj < nj 
(Fig. 1 shows an example map). Clearly, there are different maps, we 

denote the family of all these maps by Ai. 

This situation is quite similar to the wellknown spin glass model in two 
dimensions. Therefore, we will use a similar terminology and refer to the energy 
of a map instead of its smoothness or steepness. Then, the problem of minimizing 
the energy of a map corresponds to the problem of finding an exact ground state 
of a spin glass model (see e.g. [2]). Of course, there are also relations to other 
fields and problems with a two-dimensional structure, e.g. image processing. 

Definition 1. For two neighbouring grid points Xi,Xj € G and a map M, we 
define the energy of the connection Xi — xj in the map as 

^ |y(^) — y (.?) I >C5 

i.e. the connection energy is 1 if the ordinate distance is greater than a constant 
threshold c > 0, and 0 otherwise. We may fix c = ^ without loss of generality. 

Two grid points are neighbouring if they are adjacent in the grid. 

Definition 2. For a map M, we define its energy (i.e. steepness) by the sum 
over all connection energies: 

E{M) = Y. W- 

i<.j are neighbours 

The above definition has a natural interpretation in terms of inert engine 
parameters: If the distance of two neighbouring grid points Xi and Xj is too 
large, then the adjusting time for the engine parameter is larger than a certain 
limit when the operating point changes from xi to Xj . The limit could be derived 
e.g from the time resolution of the control unit. 

Since the smoothest map has the lowest energy, our problem can be stated 
as follows. 

Problem (’’SmoothMap”): Find the map Mq G Ai with minimal energy, i.e. 

E(Mo) = min E(M). 

MoGM 

Although the above energy definition is very simple, and we have restricted 
to only one optimization objective (m = 1), the resulting abstract problem 
SmoothMap has very interesting features: 
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1. It is NP-complete in the grid dimension, even if the number of candidates rij 
is limited to 2 for all 1 < j < n. This is shown in [3]. 

2. It admits a PTAS (polynomial time approximation scheme) for certain ’’non- 
degenerate” instances. (Instances with a very small energy of the optimal 
solution are called degenerate, since in this case the approximation quality 
of the PTAS cannot be assured.) However, the PTAS does not produce good 
solutions in reasonable time, on the other hand, there is no fully PTAS . 

3. If one grid dimension is equal to 1 (the one- dimensional case), there is a 
polynomial time algorithm. This is extendible to the case where one grid 
dimension is bounded. See [3] for these results. 

4. Because of the simple neighbourhood definition, the problem has a strong 
2-dimensional local structure. 

5. A very appropriate encoding for Genetic Algorithms is the direct encoding, 
which leads to a variable alphabet. This will be discussed in the sequel. 

For the spin glass model, certain cases have been proved to be NP-hard, 
while other cases are solvable in polynomial time (see [4]), including the simple 
two-dimensional case. Although the present problem seems to be very similar, 
its structure is quite different, and so is the proof of its complexity. 



2.1 Variable Alphabet Coding 

The direct encoding of a map is just a vector with n components c = 
each component Cj defining the candidate chosen at the grid point Xj. In terms 
of Genetic Algorithms, each Xj corresponds to one gene Cj of the chromosome c, 
and we have 

n 

C = (g)”=i G (g){l,...,nj}. 
j=i 

If this variable alphabet coding is used for Genetic Algorithms, we observe that 
the conditions and hence the assertions of the schema theorem are not at all 
affected (cf. [5], [6]). 

Since the genes correspond to grid points, a simple 2-dimensional arrange- 
ment of the genes resulting in a 2-dimensional chromosome would be natural. 
Appropriate crossover operators have been suggested e.g. in [7]. On the other 
hand, a conventional one-dimensional chromosome can be considered as a refer- 
ence. 

In order to show the benefits of the direct encoding, we will additionally try a 
standard binary encoding as well as a unary encoding. (Here, ’’unary encoding” 
means bit-counting, i.e. a sufficiently large part of the bit string is reserved for 
a grid point and the number of ones in this part define the candidate chosen.) 



2.2 A Local Search Operator 

Because of its 2-dimensional local structure, the problem admits a very canonical 
and powerful local search heuristic. 
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Fig. 2. Construction of a spiral line test 
instance 



Fig. 3. Fitness curves for the spiral line 
test instance 



Algorithm. 

Repeat N times 

Choose j G {1, . . . , n} randomly 

Find all fc G {1, . . . , rijl such that the resulting choice has minimal energy 
Choose randomly one of theese k 
End Repeat 

In order to obtain reference solutions, a good choice for number of loops is 
N = 10 ■ P ■ Q, where P and Q are the grid dimensions. On the other hand, we 
will use the heuristic as mutation operator for the GA thus obtaining a Memetic 
Algorithm (hybrid GA). In this case, N = P ■ Q is sufficient. 

2.3 Test Instances 

We will use three types of test instances of the abstract problem SmoothMap. 

1. Random test instances. 

For each grid point j, both the number of candidates and the candidate val- 
ues are randomly generated. This is the easiest test instance type, the test 
instances are almost surely nondegenerate. On the other hand, the global op- 
timum is unknown and intractable to compute, so one cannot check if an 
algorithm succeeds to find it. 

2. Line test instances. 

This test instance type permits to compute the global optimum. One starts 
by placing a line onto the grid, such that the line does not touch itself, i.e. 
neighbouring grid points that belong to the line are neighbouring in the line. 
In Fig. 2, the line is spiral-formed, other forms are possible as well, e.g. me- 
ander. For the line points, the number of candidates and the candidate values 
are randomly generated. Then the optimal choice for these points is calcu- 
lated with the efficient one-dimensional case algorithm (see [3]). Finally the 
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Fig. 4. Fitness curves for the random 
test instance 



Fig. 5. Fitness curves for the noisy 
plane test instance 



remaining grid points (the ’’separating” points) are randomly endowed with 
candidates, where clearly one has to assure that the optimal choice for the line 
points does not change. The resulting test instance has a global optimum that 
is easy to compute only if the line is exploited. Moreover, it is almost surely 
nondegenerate. 

3. Noisy plane test instances. 

A very simple way to produce test instances with known global optimum 0 is 
the following: Start with one candidate for each grid point and arrange these 
candidates such that the respective neighbouring differences are small, i.e. they 
form a noisy plane with energy 0. Then add more randomly placed candidates 
for each grid point, clearly the optimum remains 0. Placing these additional 
candidates far away from the noisy plane probably raises the optimization 
difficulty for the local heuristic. The resulting test instances are of course 
degenerate, and the optimum is easy to find if the existence of the noisy plane 
is known. 



2.4 Experimental Results 

The experiments have been performed with a spiral line instance, a random 
instance and a noisy plane instance. All three of them are defined on a 25 x 25 grid 
and contain from 3 to 7 candidates per grid point. These relatively large numbers 
were chosen in order to obtain difficult test instances and thus a good observation 
of the convergence speed. The following representations and crossover operators 
are compared: 

1. direct encoding with uniform crossover, 

2. binary coding with uniform crossover, 

3. unary coding (’’bit counting”) with uniform crossover, 

4. direct encoding with 2-point crossover. 
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5. 2-dimensional direct encoding with 2-point crossover. 

In order to obtain an undistorted measure of the convergence properties of the 
different representations, we use the pure GA. Additionally, a Memetic Algo- 
rithm and the pure local heuristic have been executed. 

All code has been implemented in MATLAB. The following GA parameters 
were used: five parallel populations, each of size ^ = 200 (one population of 
size fj, = 100 for the hybrid GA), number of generations t^ax = 500, tournament 
selection with q = 5, crossover and mutation probability Pcross = 0.6 and Pmut = 
1 /625 (for the hybrid GA the local search is performed with probability Pmut = 
0.01). For each setting, 30 runs have been executed. The fitness curves in the plots 
(Figs. 3-5) display the average fitness values over the generations, except for 
the pure heuristic runs, here the minimum value is shown for reference purpose. 

The Memetic Algorithm converges much faster than any of the pure GAs for 
all test data sets, even with less function evaluations per generation. This was 
not unexpected because of the strong local structure of the problem. However 
the pure GA reaches the solution quality of the Memetic Algorithm after a 
sufficient number of generations, since the combination of conventional mutation 
and selection can simulate the local heuristic. 

The direct encoding performs considerably better than both the binary and 
the unary coding. This is not only valid for the avarage values: The best binary 
or unary results are even worse than the worst variable alphabet results after 
generation 35 for all three test data sets. The 2-point crossover reduces the 
convergence speed for the 1-dimensional encoding, while in combination with 
the 2-dimensional natural encoding the performance improves a little. 

The average running time (with a Pentium III, 500 Mhz) is about 24 min 
for the variable alphabet GA, 89 min for the binary and 57 min for the unary 
coded GA. This is mainly due to the fact that the fitness function is relatively 
cheap, while the decoding of the binary or unary bit strings takes some time. 

The plots for the first two data sets look very similar. This indicates that the 
line test instance is good in the sense that its hardness is similar to the random 
instance, if the line information is not exploited. However, the global optimum 
700 was found 7 times by the Memetic Algorithm, while for the random instance 
the best ever found value 414 was obtained only once. This suggests that the 
random instance is still harder to optimize. The last test instance is apparently 
easier for the local heuristic. However, even for this data set, the pure heuristic 
fails to find the global optimum in any run. 

3 The Application 

After having investigated the abstract problem, we turn to the application. That 
is, we are given a set of operating points and candidates and try to find a candidate 
choice that results in m simultaneously smooth maps. 

As already mentioned, the operating range is spanned by the engine speed 
and the relative air mass flow. We will consider m = 2 maps here, one for the 
inlet valve spread and one for the exhaust valve spread. Both valve spreads are 
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Fig. 6. The application: Operating 

range, operating points, grid (i.e. ECU Fig. 7. Fitness curves for the applica- 

operating points) and some exhaust tion 

valve spread candidates 



mechanically adjusted actuators. Unfortunately, the operating points needed for 
the lookup tables are no more the equal to the operating points at which the 
candidates are available (see Fig. 6). Moreover, for the different candidates at 
each operating point, the location in the operating range varies slightly. I.e. two 
different candidates for one operating point have not only different exhaust valve 
spread values, but also slightly different engine speed and air mass values, as can 
be observed in Fig. 6. 

As a consequence, when a selection of candidates is given, one has to apply 
appropriate interpolation and extrapolation algorithms in order to obtain values 
for the grid points. After that, one can apply the fitness function for evaluating 
the map. Here the next question arises: Which is the right smoothness criterion? 
Clearly there are many possibilities. We will consider a very simple criterion by 
integrating the square gradients of the two maps: 

^ inlet exhaust' 

I Operating I operating 



Even with this simple formula, the local structure of the energy function 
becomes unclear with respect to the operating points because of the interpolation 
and extrapolation process. Thus a local search operator is much less efficient than 
before, since it cannot save computation time by directly exploiting the local 
structure. Instead it has to use complete evaluations of the energy function, 
which is quite expensive. 

In terms of multi-objective optimizition, we thus use a simple aggregation 
method. Clearly, other more sophisticated multi-objective techniques can be ex- 
pected to produce further interesting results, in particular for growing number 
of maps m. This should be subject of subsequent research. 
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We present the experimental results for the data set sketched in Fig. 6. There 
are 55 operating points endowed with candidates, the maximum number of can- 
didates for one operating point is 4, the grid has dimension 12 x 12. This is a rela- 
tively small data set from the real-world application. Again, we compare variable 
alphabet coding with uniform crossover, 2-point crossover and 2-dimensional 2- 
point crossover as well as unary and binary coding with uniform crossover. Since 
the operating points do not form a grid this time, we used the grid construction 
algorithm from [8] in order to obtain the 2-dimensional arrangement. 

The following GA parameters were used: population size /i = 100, number 
of generations tmax = 100, tournament selection with g = 5, crossover and 
mutation probability Pcross = 0.6 and Pmut = 1/55. Again, 30 runs have been 
performed for each setting. We didn’t use a Memetic Algorithm, since the local 
search is quite expensive in this case: One call needs more function evaluations 
than a whole GA generation, and the local search cannot efficiently exploit the 
local structure of the problem, as discussed above. Nevertheless we include the 
results of the pure local search. 

Figure 7 shows the average GA fitness curves. This smaller instance is obvi- 
ously easier to optimize than the 25 x 25 test instances. Again, direct encoding 
performs considerably better than both unary and binary coding. This time the 
best binary or unary results are worse than the worst variable alphabet results 
after generation 37. The optimal value is found by the variable alphabet GA 
with any crossover operator in all of the 30 runs. Hence we assume this value to 
be the global optimum. The binary coded GA obtains it in 27, the unary coded 
GA in 20 out of 30 runs. The pure local search finds the optimum in 20 out of 30 
runs. This seems quite good, but the heuristic does not scale up nicely, it fails 
to optimize larger instances. And it needs quite many function evaluations, as 
already mentioned. 

This time, the plot indicates that uniform crossover performs slightly better 
than the 2-dimensional 2-point crossover. This may be due to the fact that the 2- 
dimensional structure of the operating points is weaker than before. The average 
running time of the GA is about 7 min, independently of the representation. This 
is a consequence of the quite complex fitness function, where the decoding time 
becomes negligible. 



4 Discussion 

Our study covers two different aspects. On the one hand, the GA is able to op- 
timize the application we started with in a satisfactory way. The quality of the 
resulting maps is similar or even better compared to the maps that were previ- 
ously obtained manually by an engineer, a process which took several hours. On 
the other hand, we presented an easy-to-state problem with interesting features. 
We studied the performance of different codings and showed that a direct encod- 
ing is more suitable for a GA. We also tried a 2-dimensional encoding and saw 
that it can slightly accelerate the GA performance. This confirmes prior results 
of different researchers, who find that a d-dimensional encoding can result in a 
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moderate, but no significant performance gain. Finally, hybridization of a GA 
with a local search yields superior results, if the local search can be performed 
efficiently. 
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Abstract. 3-SAT is a canonical NP-complete problem: satisfiable and 
unsatisfiable instances cannot generally be distinguished in polynomial 
time. However, random 3-SAT formulas show a phase transition: for any 
large number of variables n, sparse random formulas (with m ^ 3.145n 
clauses) are almost always satisfiable, dense ones (with m ^ 4.596n 
clauses) are almost always unsatisfiable, and the transition occurs 
sharply when m/n crosses some threshold. It is believed that the 
limiting threshold is around 4.2, but it is not even known that a limit 
exists. Proofs of the satisfiability of sparse instances have come from 
analyzing heuristics: the better the heuristic analyzed, the denser the 
instances that can be proved satisfiable with high probability. To date, 
the good heuristics have all been extensions of unit-clause resolution, 
all expressible within a common framework and analyzable in a uniform 
manner through the differential equation method. Any algorithm 
expressible in this framework can be “tuned” optimally. This tuning 
requires extending the analysis via the differential equation method, and 
making use a “maximum-density multiple-choice knapsack” problem. 
The structure of optimal knapsack solutions elegantly characterizes the 
choices made by an optimized algorithm. Optimized algorithms result 
in improving the known satisfiability bound from density 3.145 to 3.26. 
Many open problems remain. It is non-trivial to extend the methods to 
4-SAT and beyond. If results are to be applicable to “real-world” 3-SAT 
instances, then the theory should be extended to formulas that need 
not be uniformly random, but obey some weaker conditions. Also, there 
is theoretical evidence that in the unsatisfiable regime it is difficult to 
prove the unsatisfiability of a given formula, while in the known region 
of satisfiability, linear-time algorithms produce satisfying assignments 
with high probability. Is the unsatisfiable regime truly hard, and is the 
whole of the satisfiable regime truly easy? In particular, as the scope of 
myopic, local algorithms is expanded so that they examine more and 
more variables, can such algorithms solve random instances arbitrarily 
close the the threshold density? 

Keywords: 2-SAT; 3-SAT; phase transition; complexity; satisfiability; 
randomized analysis of algorithms; average-case analysis; differential 
equation method. 
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1 Introduction 

Random 3-SAT formulas show a fascinating “phase transition” as a function of 
their “density” - the ratio of the number of clauses to the number of variables. 
A random formula with density less than 3.145 is almost always satisfiable, 
while one with density greater than 4.596 is almost always unsatisfiable. It is 
conjectured that the transition occurs around 4.2, but it is not even known that 
all formulas with many variables have a single, common density threshold. 

To make this precise, let Fk{n,m) be a random fc-SAT formula formed by 
selecting uniformly, independently, and with replacement, m clauses among 
all non-trivial clauses of length k, i.e., clauses with k distinct, non- 
complementary literals.^ Let us say that a sequence of events £„ holds with 

high probability (w.h.p.) if lim„_>oo Pr['£’n] = 1 and with positive probability if 

liminf„_>oo > 0. The following was first put forward in [7] and has be- 

come a folklore conjecture. 

Satisfiability Threshold Conjecture For every k ^ 2, there exists a constant 
Tfc such that for all e: > 0, 

- If m ^ (rfc — e)n then w.h.p. Fk{n,m) is satisfiable. 

- If m ^ (rfc -I- e)n then w.h.p. Fk{n,m) is unsatisfiable. 

The satisfiability threshold conjecture has attracted attention in computer 
science, mathematics, and more recently in mathematical physics [16,17,18]. Be- 
low we briefly review the current state of knowledge. 

For the linear-time solvable k = 2 case, the conjecture was settled inde- 
pendently, V 2 = 1, by Chvatal and Reed [7], Goerdt [12], and Fernandez de la 
Vega [9]. Further, Bollobas et al. [3] recently determined the problem’s “scaling 
window”: with m = r 2 n, and r 2 = 1 -I- any desired probability p of sat- 

isfiability can be achieved by an appropriate setting of A = A(p) (independent 
of n, asymptotically). 

For fc ^ 3, it is not even known whether there exists a single threshold value 
Tfc valid for all (asymptotically large n). By contrast, the existence of a threshold 
sequence, rk{n), is proved by a theorem of Friedgut [10]. 

Theorem 1 ([10]). For every k ^ 2, there exists a sequence rk(ji) such that 
for all £ > 0, 

- If m ^ {rk{n) — e)n then w.h.p. Fk{n,m) is satisfiable. 

- If m ^ {rk{n) + e)n then w.h.p. Fk{n,m) is unsatisfiable. 

An immediate corollary of Theorem 1 is that if Fk{n, nr) is satisfiable with pos- 
itive probability then, for all £ > 0, Fk{n, {r — e)n) is satisfiable w.h.p. Of course 
it remains open whether rk{n) converges; otherwise the satisfiability threshold 
conjecture would be resolved. 

^ While we adopt the Fk{n,m) model throughout the paper, all the results presented 
hold in all standard models for random fc-SAT, e.g. when clause replacement is not 
allowed and/or when each fc-clause is formed by selecting k literals uniformly at 
random with replacement. 
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Existing proofs that dense formulas are typically unsatisfiable all rely on 
probabilistic counting arguments. The simplest of these forms a small part of 
Chvatal and Szemeredi [6]. The probability that any truth assignment satisfies 
a random clause is 7/8, the probability that it satisfies all m clauses in a ran- 
dom formula is (T/S)*”, and so the expected number of assignments satisfying 
a random formula is 2”(7/8)’”; for m/n > ln(2)/ln(8/7) « 5.19, the expected 
number of satisfying assignments (and the probability that there is any satisfying 
assignment) is small. After a series of improvements in the bound, the current 
best, due to Janson, Stamatiou and Vamvakari [13], is 4.596. 

The greater part of Chvatal and Szemeredi [6] is to show that even random 
unsatisfiable formulas seem to be computationally intractable. Specifically, they 
show that for many unsatisfiable formulas (a polynomial fraction of them), every 
“resolution” proof of unsatisfiability is exponentially long, never mind the diffi- 
culty of finding it. This does not rule out the possibility of short unsatisfiability 
proofs of some other sort, but there has been limited progress in this regard. 
Friedman and Goerdt [11] show that random 3-SAT formulas with 
clauses can be certified unsatisfiable in polynomial time using spectral meth- 
ods, but no such method is known for formulas of any linear size. This suggests 
that unsatisfiability of even random, linear-sized formulas is genuinely hard, like 
general (un)satisfiability. 

By contrast, although 3-SAT is NP-complete, satisfiability for random for- 
mulas below threshold appears to be easy. In fact, proofs that sufficiently sparse 
formulas are satisfiable have all come from analyzing algorithms that satisfy them 
with positive or high probability, and these algorithms all run in linear time. The 
first high-probability heuristic was given by Broder, Frieze and Upfal [4] who 
proved that w.h.p. the pure literal heuristic succeeds on F^{n,rn) for r ^ 1.63, 
but fails for r ^ 1.7. This work was predated by Chao and Franco [5], who 
showed that a certain heuristic succeeds with positive probability for r < 2.9; 
unfortunately this had to await [10] before it could be interpreted as a satisfia- 
bility threshold bound. 

2 Random 3-SAT below the Threshold 

The more powerful the algorithm analyzed, the denser the formulas that can be 
proved satisfiable with high probability. The best algorithms have all been of a 
type Achlioptas and Sorkin [2] call “myopic”, and they show how to optimally 
tune any of them. The technical details are supplied in [2], so the following is a 
still detailed but more conversational presentation. 

In the course of the algorithm, the random formula F will repeatedly be 
“reduced”: when a literal A is set to True (we will write A = 1), any clause in 
which A appears is satisfied, and may be discarded from F, while any in which 
A appears fails to be satisfied by A itself, and so may be shortened by discarding 
A from it (a 3-clause becomes a 2-clause, a 2-clause becomes a 1-clause, and a 
1-clause becomes a 0-clause — meaning that F is unsatisfied). Letting mi, m2, 
and m3 denote the number of “unit” (1-variable) clauses, 2-clauses, and 3-clauses 
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in F, and n as before the number of variables, we define in the obvious way a 
random formula F G 



2.1 A Myopic Algorithm 

Rather than setting forth a formal definition, consider the following example of 
a “1- variable” myopic algorithm. 

One round: 

Free move: 

• Choose a random 2-clause literal, A. 

• Expose all occurrences of A and A, giving 
counts A = (A+,A 2 ,A+,A^). 

• Set A = a(A). (l=true, 0=false.) 

• Reduce F. 

— Forced moves: 

• While F has any unit clauses, choose one at random, set its literal 
B = 1, and reduce F. 

That is, first, a random literal A is chosen from a 2-variable clause.^ 
By “revealing” all occurrences of A and A, we mean identifying (or equiva- 
lently, choosing) the clauses in which they appear and thereby conditioning 
the random formula F, but leaving F otherwise unconditioned. The counts 
A = {A 2 , A 2 , A~^ , A'^) simply denote the number of occurrences of A in 2- 
clauses (at least 1, by definition), the number of occurrences of A in 2-clauses, 
and similarly the number of occurrences of A and A in 3-clauses. The assignment 
of True or False to A is made as a function cr of A, A = (7(A). Choosing the 
“policy” a :lA ^ {0, 1} optimally is exactly what we mean by the optimal tun- 
ing of this myopic algorithm. In addition to its explicit argument, u(A) is also 
permitted to depend on the 2- and 3-clause densities p 2 = m^jn and pa = m^/n] 
we may write A = a(n, p 2 , Ps, A) to be more explicit. There is no role for mi 
because in the free move of each round, mi = 0. 

If at any stage F contains an empty clause, then it is unsatisfied by the 
constructed assignment; the algorithm terminates in failure. Otherwise, rounds 
are made until F is empty; the empty formula is satisfied, and the original 
formula is satisfied by the constructed assignment. Variables set by the algorithm 
are not reversed (except in a slight variant of the standard algorithm, discussed 
soon) . 



^ Initially, of course, F has only 3-clauses, no 2-clauses. This is a minor matter; for 
our purposes it suffices, for example, simply to add a random 2-clause to F if it has 
none. But after the beginning, it will have plenty, so this is a non-issue. 
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2.2 Issues in the Algorithm’s Analysis 

There are several obvious questions. What can be said about the intermediate 
formulas F' derived as the algorithm runs? Why is a allowed to depend on p2 and 
pz, but not on all of n, m2, and m^l Why is the setting of A = cr(A) not trivial 
— set A so as to satisfy as many clauses as possible? What is the algorithm’s 
probability of succeeding? What does it mean to optimize the algorithm, and 
what choice of policy function cr achieves this? 

2.3 Intermediate Formulas 

First, if any intermediate formula F' in the algorithm’s execution is satisfiable, 
then the original formula F is also satisfiable: Set each variable in F that is 
also present in F' the way it is set to satisfy F' , and set the variables in F 
missing from F' the way they were set during the algorithm’s execution. If F' is 
unsatisfiable, then the algorithm has failed; F may or may not be satisfiable. 

The intermediate formulas produced as rounds continue are themselves ran- 
dom, as per the following lemma. 

Lemma 1. Let F € F{n;m2,mz) be a random formula. Apply one round of 
any myopic algorithm to F, producing formula F' . Conditional upon the round 
setting no variables, and F' having 2- and 3-clauses, F' is a random 

formula in F{n — no; m'2, m3). 

This is a simple fact: all that is known about the literals in F' are that they are 
not any of the literals that have been revealed. 

2.4 Success and Failure 

Next, what about success or failure of the algorithm? In a round, the algorithm 
will fail only if two contradictory unit clauses are generated, B and B. But w.h.p. 
the number of literal occurrences exposed during any move is 0(1); also, as 
each unit clause is resolved, the expected number of new unit clauses generated 
is p2, so as long as the 2-clause density p2 < 1, w.h.p. the number of literal 
occurrences exposed in a round is 0 ( 1 ). Thus, as long as p2 < 1 , w.h.p. the 
total number of unit clauses generated in a round is 0(1), and so the probability 
that any two are in contradiction is 0 ( 1 /n). That is, each round succeeds with 
probability 1 — 0(l/n); the rounds are all independent; and so the algorithm’s 
success probability over all 0 (n) rounds is 0 ( 1 ). If ever the 2 -clause density 
rises above 1 , then it follows from the unsatisfiability of random 2 -SAT with 
such density that the current working formula F is w.h.p. unsatisfiable, and so 
w.h.p. the algorithm must fail. The way it does so is that if p2 > 1 , then as each 
unit clause is satisfied, it creates on average more than one new unit clauses, the 
“branching process” explodes, many unit clauses are generated within a single 
round, and it is overwhelmingly likely that two of them contradict one another. 

So as long as p2 < 1 — and we will ensure that this is so — the algorithm 
fails with only small probability in a round, and only constant probability over 
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all. For convenience, we may ignore these chance failures, for either of two rea- 
sons. One reason is that Friedgut’s result says that a random formula of any 
fixed density (away from the one critical value) is either satisfiable w.h.p. or 
is unsatisfiable w.h.p. If our algorithm succeeds with constant probability, then 
it must be that F is satisfiable w.h.p., which is what we originally wished to 
show. The other reason is that we can modify the algorithm to make it succeed 
with high probability, rather than merely with constant probability. This is a bit 
tricky, and not necessary to the understanding of the main results. 

The modification requires a more exact specification of the algorithm: after 
the free move, set all unit-clause literals in this first “layer” simultaneously, then 
set all newly created unit-clause literals (in the second layer) simultaneously, and 
so forth, until a layer is generated which has contradictory clauses. The literals in 
the last layer were all there because they appeared, in 2 -clauses, with negations 
of the literals in the second-last layer. Thus, reversing the settings of the literals 
in the second-last layer, the literal occurrences defining the last layer all lie 
in already-satisfied clauses. Similarly, reverse the settings of the literals in the 
third-last layer to satisfy the clauses containing the literal occurrences defining 
the second-last layer. Repeat, finally reversing the setting of the literal chosen 
in the free move. With the setting of the various literals reversed, a different 
set of unit clauses is generated; follow the usual procedure to resolve them, the 
new unit clauses they generate, and so forth. Note that the original number 
of unit clauses, and the new number, are w.h.p. both 0 ( 1 ), so the probability 
that there is a conflict within the new unit clauses, or between the new ones 
and the re-set old ones, is 0(l/n). The probability that a round fails both the 
normal procedure and this backup procedure is 0 (l/n^); over all 0 (n) rounds, 
the probability of any failure is 0 (l/n), and so the algorithm succeeds w.h.p. 

Combining the preceding remarks, we run the algorithm until either p 2 ^ 1 
(in which case w.h.p. the algorithm has failed, and we know nothing about the 
original formula) or until we reach a point {p 2 , ps) where it is known that almost 
all formulas are satisfiable (in which case w.h.p. the algorithm has succeeded). 
One or the other outcome will always occur. Moreover, as will be seen, for a given 
initial pair of densities (p 2 , Ps), one of the the outcomes will occur almost always, 
so for random formulas with given parameters, either w.h.p. we demonstrate 
satisfiability, or w.h.p. we fail to demonstrate anything. 

2.5 Optimizing Policies, Optimal Algorithms, and Parametrization 

Now we take up our last questions, about the optimal setting of the free-round 
variable A, and the optimal choice of the policy function a. First, we cannot 
simply set A “to maximize the number of satisfied clauses” because one setting 
of A might satisfy more 2-clauses, while the opposite setting might satisfy more 
3-clauses, and in such a case the decision is not clear. (The intuition is correct, 
that if one setting maximizes both, it should be used.) The function a should of 
course depend on (Aj, A^, A~^ , Ag ), which contains all the information (revealed 
randomness) about the structure of the particular random formula F. We allow 
a to depend additionally not on all of n, m 2 , and m 3 (all the information there 
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is), but only on p2 and ps, because in two senses, (p 2 ?P 3 ) is the “correct” 
parametrization. 

First, with the particular functions a{p2,P3] , A2 ,A'^ ,A^) that will be of 

interest to us, the parameters of the formulas F' produced during the execution 
of the algorithm will, almost surely, almost exactly, follow a predictable trajec- 
tory through (p2, P3)-space. (See Figure 1.) If the trajectory — which of course 
depends on the initial densities (p2,P3) — first passes through a region where 
formulas are known to be satisfiable w.h.p. (such as p2 < 1 and ps < 2/3) then 
the original formula was satisfiable w.h.p.; if it first passes through a point with 
P2 ^ 1, then we can say nothing. 




P3 



Fig. 1. Optimal gradients dp2/dpz for “one variable at a time”, and the trajectory 
starting from (3.22,0) and stopping when ps < 2/3. 



Second, no algorithm of the same type, but using all the available information 
in a policy A = a{n, m2, m3] A2 , A~^ , A~^), can give a better (lower) trajec- 
tory; therefore no such more general policy can allow us to reach a stronger con- 
clusion about satisfiability. In other words, for our purposes, a policy cr(p2, ps; A) 
is optimal. 

Let us argue these points in a little more detail. 



2.6 Fixed Policies 

Suppose that the policy does not depend on (p2,pa), i.e., a{p2, Pz] A) = (j{A). 
Consider a round of the algorithm, when the working formula has densities 
(p2. Pa). In the free move beginning the round, the probability distribution over 
tuples A = (A2 , A2 , A~^ , A'^) can be calculated almost exactly. For example, 
A2 = 1-|-X(p2), where ^(A) indicates a Poisson random variable with parameter 
A; the other variables simply have Poisson distributions (no additive constants), 
and the variables are approximately independent. 
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The tuple determines, through the policy a, the setting A = and in 

turn the number of eliminated and reduced clauses of each size (including the 
number of unit clauses formed). For each unit clause resolved (each forced move), 
again, it is straightforward to compute the distribution of the number of clauses 
of each size eliminated and reduced. The expected number of forced moves can 
also be calculated. In short, for each tuple A, given cr{A), we can calculate 
the expected changes to n, m2, and m3 in the course of the round. (Details of 
these calculations can be found in [2], but they are irrelevant to the thrust of 
the argument.) That is, we can calculate the expected change in the number of 
variables, 

E{An) = ^Pr(A)E(Z\n|A,cr(A)) 

A 

and likewise for E(Z\m2) and E(Z\m3). All such changes will be small — just 
0(1) — and so the changes to the parameters (p2,P3) will be just 0(l/n). 

Since the key parameters {p2,P3) change very little in a round, we are free 
to repeat the analysis over a number of rounds ui, where ui = oj(l) (large), but 
to = o{n) (small enough that {p2,P3) will only change by o(l)). Over to rounds, 
the number of times any tuple A occurs in the free move is determined almost 
exactly; given A, the expected changes to n, m2, and m3 in the whole round are 
determined by A and cr(A); and by a law of large numbers, almost surely, the 
actual change over u> rounds in the number of variables satisfies 

A^n = u}E{An) (1 + o(l)) 

= cc ^ Pr(A)E(An| A, a{A)) (1 + o(l)), 

A 

and similarly for A,^m2 and A^m^. Since p2 = m2/n, 

E{Ap 2 ) = E ( -^)= -E{Am 2 - P 2 An) (1 + o(l)). ( 1 ) 

It follows that the change in p2 over co rounds is almost surely almost exactly 
equal to E{A^^p2), like the changes in n, m2, and m3. Likewise, of course, for p^. 

Moreover, because the function Pr(A) is smooth in p2 and p3, so are the 
expected changes E(Z\(^P2) and E(Z\,^P3): this pair defines a smooth vector field. 
(Such a vector field would be similar to the line field of Figure 1 , but where a line 
field has only slopes at each point, a vector field has a slope and magnitude at 
each point.) Thus, the differential equation method [ 14 , 19 ], allows us to conclude 
that there is a trajectory almost surely almost exactly followed by {p2,P3) over 
many rounds, and the derivative of this trajectory matches the vector field (in 
the same way that the trajectory shown in Figure 1 matches the slopes of its 
line field). 

2.7 Optimized Policies 

Intuitively, the best choice of policy cr(A) is one giving the “shallowest” slopes 
Ap2j Ap'i — i.e., since all the slopes are negative, maximizing them — for that 
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means that starting from a denser formula (a point further to the right on the 
X axis) we may still enter the known satisfiability region. 

From (1), 

E(Z\p2) ^ E(Z\m2 - P2^n) 

E(Z\p3) E(Z\m3 - pzAn) ^ 

^ EAPr(A)[E(Z\m2|A,g(A)) -p2E(Zlt|A,cr(A))] 

E^Pr(A)[E(Z\m3|A,a(A))-p3E(Zlt|A,a(A))]^ ^ ^ ^ 

The maximization can be performed efficiently through a “maximum-density 
multiple-choice knapsack” formulation, as detailed in [2].^ Roughly, replacing 
“cases” A with i, and policy choices A = a(A) with j = a{i), (2) can be 
generalized as 

/ON 

and the goal is, given all values x{i,j) and y{i,j), to choose the mapping a \ i ^ j 
so as to maximize (3). 

Returning to the SAT context, because Pr(A) depends on (p 2 , Pz), so does the 
optimized policy Cf{p 2 , Pz] A) — the optimized policy is not a fixed policy. This 
must be taken into account in our analysis, which extends over many rounds, 
during the course of which {p2, Pz) changes, and thus cr changes. In fact, this 
is a real issue. For any two points (p2,Pz) and however close, there 

is at least some “case” A for which the optimized policies differ; when this 
case occurs, two policies will yield different values E(Z\n|A) and so forth. Since 
any case A occurs with positive probability, even at arbitrarily nearby points, 
the two optimized policies will produce different values E(Z\p 2 ) and E(Z\p 3 ) for 
the expected changes over a single round. That is to say, if in Figure 1 the 
optimized policy’s line field E(Z\p 2 )/E( 2 \p 3 ) were replaced by its vector field 
(E(Z\p 2 ), E(Z\p 3 )), that vector field would be almost everywhere discontinuous. 

Fortunately, it is in the nature of the optimized policy that the slopes 
E(Z\p 2 )/E(Z\p 3 ) change smoothly. Were the optimized slope much better at 
{p2, Pz) than at a nearby point {p2,Pz): the optimized (p2,Pz) policy used at 
{p2,Pz) would give essentially the same slope as it did at {p2,Pz) (fixed poli- 
cies have smooth line fields), contradicting the supposed optimization of the 

® In principle, since this maximization is just used for the algorithm-design phase, 
it need not be efficient; we could devote an arbitrary amount of computational 
resource to doing it once. But in practice it is very useful to have an efficient method, 
and that of [2] works in near-linear time. Here we measure run time as a function 
of the number of different cases A considered (cases i in the general version). In 
practice, we optimize the policy cr(A) for a set of cases A covering most of the 
probability space, and take some default policy choice such as cr(A) = 1 on the rest. 
Furthermore, the the method of solving the knapsack problem evinces structural 
aspects of the optimum, and these are interesting in their own right; for example, 
at each point in (p 2 , P 3 )-space, whether to set o-(A) = 0 or 1 depends on the test 
(Aj — A 2 ) <> A(A^ — Ag ), for some “constant” \{p 2 ,pz). 
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different policy at That is, while an optimized policy’s vector field is 

discontinuous, its line field is smooth. It follows that the optimized policy is 
again susceptible to the same sort of analysis as a fixed policy, with a slight 
extension to the differential equation method to work with the line field and not 
the vector field. 

2.8 Optimal Policies 

While we have argued intuitively for our method of optimizing policies by max- 
imizing slopes, how do we know that such policies are indeed optimal? In par- 
ticular, how can we compare them with policies that may depend on additional 
information (n, m2, and m3, rather than just p 2 and ps), and that may not 
be well-behaved, and therefore not in any way susceptible to analysis via the 
differential equation method? 

The answer is that, for an arbitrary policy, we can still compute the expected 
changes E(Z\n) et cetera over one round, and therefore E(Z\p2)/E(^P3) over a 
round. By definition, our “optimized” policy maximized this ratio. Let us take 
the optimal ratio and generously add a small value e > 0 to it, to give ourselves 
an unfair edge. Then any other policy gives a strictly smaller slope. We can 
conclude that, over a large number of rounds, any competing policy produces a 
sequence of values {p 2 , P 3 ) almost surely lying strictly above the trajectory given 
by our policy. That it can never cross back below again follows from real analysis, 
taking advantage of Lipschitz continuity of our optimized policy; no assumptions 
on the competing policy are needed. Again exploiting Lipschitz continuity, the 
optimized policy is only some S{e) worse than the e-boosted version, and any 
competing policy is strictly worse than the boosted one, so it follows that the 
competing policy is no better than the (unboosted) optimized policy. 

Thus, our optimized policy truly is optimal. 

Optimizing the 1-variable algorithm presented at the beginning of this sec- 
tion gives precisely the outcome illustrated in Figure 1. Starting from {p 2 ,P 3 ) = 
(0,3.22), we come to a point (1,2/3 — e) which is known to be feasible; thus, 
a random 3-SAT formulas with n variables and 3.22n or fewer clauses is al- 
most surely satisfiable. This result, in [2], already improved on the previous best 
satisfiability bound. 

2.9 Better Algorithms 

Following exactly the same methods as above, we can consider other myopic 
algorithms. The first candidate is the algorithm considered by Achlioptas in [1], 
where instead of choosing one literal A from a 2-clause, we choose both its 
literals, say A and Achlioptas showed that whereas the best density that 
had been proved satisfiable with a 1-variable algorithm was 3.003, with the 2- 
variable algorithm he could achieve a bound of 3.145. 

^ If we were to choose two literals independently, from different clauses, then because 
w.h.p. each has a finite “sphere of influence”, there is an 0(l/n) chance that they 
interact, and this is essentially the same as the 1-variable algorithm. 
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The 2-variable algorithm can be optimized the same procedures; that the pol- 
icy a now depends on parameters {A^ ,B^) does not 

change anything. Interestingly, the optimal policy for the 2-variable algorithm 
achieves a bound of 3.19 — better (of course) than Achlioptas’ bound of 3.145, 
but not as good as the bound of 3.22 achieved with the optimized 1-variable 
algorithm. There is no obvious explanation for this, except that while selecting 
two variables at once can indeed help to reduce the number of clauses faster, it 
also certainly reduces the number of variables faster (by 2 instead of 1 in the 
first instance), and so it is plausible that the densities (clauses per variable) it 
achieves are less good. Another aspect may be that when one variable is chosen, 
there are two ways to set it; when two variables are chosen from the same clause, 
there are only 3 possible settings. 

An algorithm that is clearly at least as good as both the 1- and 2-variable 
algorithms is a mixed algorithm. It randomly chooses a single literal A from a 
2-clause and decides, as a function of the counts A, whether to set A = 1, set 
A = 0, or look at A’s companion literal B, observe its counts B, and then decide 
how to set both A and B. It is at least as good as the 1-variable algorithm because 
it has the option of never looking at B, and at least as good as the 2-variable 
algorithm because it has the option of always looking at B. Optimizing the policy 
for this algorithm requires a small extension of the “knapsack” method, but can 
still be done in essentially the same way. It results in a satisfiability bound of 
3.26, the best known to date. 



2.10 Further? 

The method for optimizing a policy for a myopic 3-SAT algorithm does not 
generalize obviously to /c-SAT for any fc > 3. The 3-SAT result relied on the 
fact that a random 3-SAT formula is completely parametrized by n, mi, m 2 , 
and m 3 ; that between rounds of a myopic algorithm toi = 0; that the remaining 
parameters n, m 2 , and m 3 can be sufficiently represented by just the densities 
P 2 = m 2 / n and p 3 = m 3 jn; and that optimization is connected with maximizing 
a single parameter, namely the slope Ap 2 j Ap 3 . Even for 4-SAT, while it is still 
true that a random formula can adequately be parameterized by its densities, 
now there are 3 densities (p 2 , Pa, and pa), and from each point in density space, 
there is a 2-parameter family of directions. It is not clear how to choose the best 
direction. 

Even for 3-SAT, as remarked in the introduction, [6] provides theoretical 
evidence that in the unsatisfiable regime it is difficult to prove the unsatisfiability 
of a given formula. On the other hand, the only way that we know the satisfiable 
regime is through algorithms that succeed w.h.p. The best is the mixed algorithm 
described above, showing that formulas F G F{n,3.26n) are satisfiable w.h.p.; 
like all myopic algorithms, it runs in essentially linear time. So while in the 
unsatisfiable regime we know that there are many formulas whose unsatisfiability 
is hard to prove (by any known means), in all we know of the satisfiable regime, 
almost all formulas can be satisfied by a linear-time algorithm. 
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This suggests dual questions to which we have no answers: (1) Is it truly 
hard to prove that a random formula in the unsatisfiable regime is unsatisfiable? 
(2) Is it true that a random formula anywhere in the satisfiable regime (not just 
the part of the regime that we know today) can be satisfied in polynomial time, 
or even linear time? As the mixed algorithm, which looks at a broader “sphere of 
influence” than the 1-variable algorithm, increases the satisfiability bound from 
3.22 to 3.26, can further extending the sphere of influence give algorithms which 
work right up to the satisfiability threshold, or are more sophisticated methods 
required? 

3 Relaxing the Uniform Randomness Assumption 

Average-case analysis of algorithms is notoriously tricky, because the meaning 
is not clear: real-world instances are not uniformly random. However, they are 
probably not worst-case, either. 

A small step towards generalizing beyond uniform random formulas is taken 
by Cooper, Frieze, and Sorkin [8]. Here, rather than specifying only the number 
of clauses, the degree (number of appearances) of each literal is given. 

This result can be seen as filling in an obvious gap. The phase transition in 
2-SAT at p 2 = 1 is analogous to the birth of a giant component in a random 
graph as the density of edges per vertex passes mjn = 1/2; both results can be 
seen as the point when the average fanout of a branching process exceeds 1. (In 
the graph model, the branching process goes from a vertex to its neighbors, and 
so on; in the 2-SAT model, from a setting of a literal to its implications and so 
on.) 

Molloy and Reed [15] generalized the standard G(n, m) random graph model 
to random graphs with a specified degree sequence, asking what property of 
the degree sequence controlled whether the graph would, w.h.p., have a giant 
component. Again it is natural to ask the analogous question for 2-SAT, and [8] 
resolved that the answer too is analogous. The main result is as follows. 

A degree sequence d = di, di, . . . , d„, is proper if 

— every di,di ^ where a < 1/13 is a constant. 

” + di) is even. 



Theorem 2. Let d he proper and let 0 < e < 1 be eonstant. Then if F is chosen 
uniformly at random from all formulas having d as their degree sequence, then 
with Di = J2'i=i(di + di) and £>2 = d^, 

(a) if 2 D 2 < (1 — s)Di then 

Pr(F is satisfiable) — >■ 1, and 

(b) z/2£>2 > (1 + e)d?i then 



Pr(F is unsatisfiable) — >■ 1. 
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In the case of m = cn randomly chosen clauses, D\ = 2cn and w.h.p. D 2 ~ c^n, 
giving the familiar 2-SAT phase transition of [7]. 

The “satisfiable” half of the theorem is relatively easy to prove. The subcrit- 
icality of the branching process means that there are unlikely to be long paths 
(chains of implications), and there are likely to be no occurrences of a related 
structure, a “bicycle”, shown by Chvatal and Reed [7] to be a necessary condition 
for unsatisfiability. 

The “unsatisfiable” half is a bit harder. It is argued that a complementary 
pair of literals A, A can be found such that both span{A) and span{A) (the sets 
of implications of setting each True) are “large”. Further, w.h.p., large spans 
contain complementary literals; that is, w.h.p. A ^ X and A ^ X, while 
A ^ Y and A ^ Y. If this is so, the formula is unsatisfiable. 

4 Conclusions 

A certain amount is known about random fc-SAT formulas. Sparse formulas 
are almost all satisfiable, dense ones are almost all unsatisfiable, and there is a 
sharp threshold somewhere in between. The case of 2-SAT is understood well: 
it is known exactly where the threshold lies and how sharp it is. The 2-SAT 
threshold point can be generalized to random formulas with a specified degree 
sequence. 

But much less is known for k > 2. Here it is not even known if the threshold 
density has a limit as n — >■ 00 , although it is expected that it does, for all k, 
and that for 3-SAT the limit is about 4.2. Although it is likely that both the 
known 3-SAT satisfiability bound of 3.26 and unsatisfiability bound of 4.596 can 
be improved, different methods will be needed to place either of them at the true 
threshold value. For fc > 3, the methods of [2] no longer suffice to tune myopic 
satisfiability algorithms. 

Nothing is known about relaxing the uniform randomness assumption for 
random 3-SAT formulas, and since in a 3-clause no variable’s setting has any 
direct implications, generalizing the result of [8] is not straightforward. 

A broader question is whether, as [6] suggests, random unsatisfiable formulas 
are (on average) truly hard to prove to be unsatisfiable, and — on the other side 
— whether stronger myopic algorithms suffice to solve random formulas in the 
satisfiable regime, whether other polynomial-time algorithms are required, or 
whether denser formulas within the satisfiable regime are hard. 
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Abstract. We present new ideas on how to make simulated annealing 
applicable to the combinatorial optimization problem of accumulating 
a Jacobian matrix of a given vector function using the minimal number 
of arithmetic operations. Building on vertex elimination in computa- 
tional graphs we describe how simulated annealing can be used to find 
good approximations to the solution of this problem at a reasonable cost. 

Keywords: Accumulation of Jacobian matrices, vertex elimination, 
simulated annealing, logarithmic cooling schedule, inhomogeneous 
Markov chains. 



1 Problem Description 

The requirement to compute accurate derivative information for mathematical 
models efficiently is central to many scientific, economical, and engineering prob- 
lems. Without it the highly desirable step from simulation to optimization can 
often not be made. Especially first order derivatives of vector functions given as 
computer programs play an important role in modern numerical analysis. 

Let F' = E'(xo) = ( gl^(xo) ) denote the Jacobian matrix of a non- 

linear vector function F : JR" D ZJ — >• IR"^ : x i— y = F{x) evaluated at 
some argument xq. Automatic Differentiation (AD) [7], [3], [10], [6] enables us 
to compute F' numerically with machine accuracy. F is assumed to be given 
as a computer program which decomposes into a sequence of scalar elemental 
functions {IR ^)vj = (pj{vi)i^j where j = 1, . . . ,q and p = q — m. i<j denotes 
the direct dependence of Vj on Vi. We write i -<* j if there exist ki, . . . ,kp such 
that i ki k 2 < ■ ■ ■ < kp j. So, {i\i j} is the index set of arguments of ipj 
and we denote its cardinality by \{i\i -< j}|. Within F we distinguish between 
three types of variables V = X U Z U V, independent (X = {ui_„, . . . ,Uo}), 
intermediate (Z = {vi, . . . ,Vpj), and dependent (Y = {up+i, . . . , Wg}). We 
set Xi = Vi-n, i = l,...,n, and yj = Vp+j, j = The numbering 

T :V — >■ {(1 — n), . . . , q} of the variables of F is expected to induce a topologi- 
cal order with respect to the dependence i.e. i j ^ < Y{vj). 

Since the differentiation of F is based on the differentiability of its elemental 
functions it will be assumed that the ipj, j = 1, . . . ,q, have jointly continuous 
partial derivatives cji = -^Pj{vk)k^j, i j, on open neighborhoods Vj C 

K. Steinhofel (Ed.): SAGA 2001, LNCS 2264, pp. 131-144, 2001. 
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Fig. 1. G and G' 



rij = \{i\i < j}\, of their domain. The computational graph (or c-graph) G = 
(V,E) of F is a directed acyclic graph with V = {i\vi G F} and (i,j) G F if 
i -< j. We assume G to be linearized in the sense that the numerical values of all 
partial derivatives of the elemental functions are attached to their corresponding 
edges, i.e. (i,j) is labelled with Cji. Fig. 1 shows an example of a c-graph with 
n = 2, p = 3, and m = 2. The local partial derivatives Cj^i, i -< j, are denoted by 
a, . . . , h. Assuming, for example, that V\ = u_i • Vq and = sin(w 3 ) we get 



a = 



d{v-i ■ Up) 



Vo 



and 



^ 9(sin(?;3)) 
9v3 



cos(?;3). 



The accumulation of F' can be regarded as the process of transforming G into 
a subgraph G' of the complete bi-partite graph Kn^m [H]- This transformation 
will be denoted by G — >■ G'. Different application sequences of the chain rule to 
G may yield drastically varying computational costs. 

By the chain rule an entry F'{i,j) of the Jacobian can be computed by 
multiplying the edge labels over all paths connecting the minimal vertex j with 
the maximal vertex i followed by building the sum over all these products [16]. 
For example, the Jacobian of the c-graph in Fig. 1 is given by 

p! _ /ef + acdf bcdf \ 

\^eg + acdg h J- bcdg J ’ 



which is equivalent to G' = as shown in Fig. 1 on the right. Assuming that 
modern floating-point units can compute an addition on top of a multiplication 
at virtually no extra cost [14] we will take the number of multiplications as the 
objective function to be minimized. The resulting combinatorial optimization 
problem is called Optimal Jacobian Accumulation (OJA) problem [22] 
and it is conjectured to be NP-complete [11], [5]. This conjecture is based on 
similarities between OJA and the Gaussian elimination problem considered in 
[27]. Herley [13] could show that the closely related Fill-in problem is NP- 
complete [9] under vertex elimination. The proofs of corresponding results for 
edge and face elimination [24] could not be given so far. 
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Fig. 2. Elimination of vertex 1 



The successive elimination of all (vertex-)paths of length > 2 in the above 
example would take 14 multiplications. An optimal application sequence of the 
chain rule would give us F' at the cost of only 7 multiplications. Simply pre- 
accumulate r = cd, s = ar -|- e, and t = hr and compute 

F'=(^^ tf 
\^sg tg + h 

In fact, this optimal way of computing F' is equivalent to the vertex elimination 
sequence (2,1,3) in G. Vertex elimination will be introduced in the following 
section. 

2 Simulated Annealing 

2.1 Configuration 

Our discussion will be based on vertex elimination which was first used in [11] 
to optimize the accumulation of the Jacobian by applying the Markowitz rule 
adapted from the theory on the solution of sparse linear systems. 

Rule 1 To eliminate a vertex j from G we perform for all i < j and all k with 
j ^ k the following steps: 

1. Compute d = Ckj ■ Cj^i. 

2. If {i, fc) G G then 

set “t“ d 

else 

— introduce a new edge (i,k) € G; 

— label it with d. 
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A graphical illustration of Rule 1 can be found in Fig. 2 which shows G after 
the elimination of vertex 1. Assuming that Vz,j GY '■ i j A j yi, i and 
Vz, j G A : i 7 ^ j A j 7 ^ i we get the following 

Proposition 1 The transformation G — 1 G' can he regarded as the elimination 
of the p intermediate vertices in G using Rule 1. 

Proof; The proof of Prop. 1 follows immediately from the chain rule and 
it can be found in [11]. □ 

Our configuration space is the set of all feasible vertex elimination sequences 
in G. It will be denoted by V and it is finite with jVj = p\. A vertex elimination 
sequence ct is a permutation of the indices of the intermediate vertices in G, i.e. 

cr = (tti, . . . , TTp), where Vz, j, z yf j : 1 < < p, 1 < < p, yf tt^. 

According to Rule 1 the elimination of a vertex j G Z involves jPj^j ■ |5'J| scalar 
multiplications. \Pj\ and jS'Jj denote the numbers of predecessors and successors 
of j, respectively. Both values depend on the order within the current elimination 
sequence a. We aim to minimize the objective function 

C = C(a) = f]|P-|-|5:j 

i=i 

over V. Notice, that C G W for all ct G V. The associated combinatorial optimiza- 
tion problem will be referred to as the Vertex Elimination (VE) problem. 
The set of optimal vertex elimination sequences is defined by 

V„in = {ct|Vct' G V : C(ct) < C(ct')}. 



Proposition 2 The complexity of evaluating C over V is 0{p^). 



Proof; Without loss of generality, let to = rz and k = 2 ■ n + p with p odd. 
Considering the most expensive vertex elimination sequence possible we observe 
that the overall cost adds up to 



fc- 1 



k — I f k — 1 



- 1 



k-l 



- 1 ) -I- . . . -I- rz^ = O(p^) 



for a given rz. □ Obviously, the evaluation of C is polynomial in rz and to too. 

VE is not equivalent to OJA [19]. There are cases where the optimal vertex 
elimination sequence does not minimize the number of scalar multiplications 
required to compute F' . However, building on a large number of numerical tests 
[19], we conjecture that this vertex discrepancy does not exceed a small constant 
factor. This makes vertex elimination a sensible approach to approximating the 
solution of OJA. Refer to Sec. 4 for further remarks. 
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2.2 Neighborhood Relations 



The effect of simulated annealing on the solution of VE has been investigated 
first in [23] by combining rearrangements suggested in [15] with various cool- 
ing schedules. The corresponding algorithms were applied to a selection of 
widely used test problems in numerical simulation and optimization showing 
very promising results. In this paper we will propose new ideas on how to make 
logarithmic cooling schedules for inhomogeneous Markov chains [12] work on 
vertex elimination sequences in c-graphs. Furthermore, the results of this ap- 
proach will be compared with standard pre-accumulation techniques [11] as well 
as with a simple fixed-temperature Metropolis [17] algorithm for an a priori fixed 
stopping time. 

Neighborhood relations on V are defined by rearrangements transforming 
CT G V into a' £ Na C V where denotes the neighborhood of cr. These 
transitions will be denoted by [u, a'] and the probability of generating a' from a 
via a corresponding sequence of transitions by G[cr, a']. For simplicity, transitions 
will be defined such that 



G[a, a'] 



1 

m' 



(1) 



For this purpose we will denote a uniformly distributed random number over 
the closed interval [l,p] C IN by r £ 77.([l,p]). Acceptance probabilities are 
associated with all feasible transitions [a, cr'] . They are defined by 



A[a, a'] 




C(<7')-C(a) 

t 



if C(ct') < C(cr), 
otherwise. 



(2) 



where the control parameter t can be interpreted as the temperature in the 
annealing process [17], [29]. The probability of a transition [cr, a'\ actually being 
executed is 



T*[cr, a'\ 



G[ct, cr'] • A[cr, cr'], if CT yf cr', 

1 — X)cr'/cr otherwise. 



( 3 ) 



Building on (3) one can define the probability of being in a configuration cr after 
k steps: 



P^r(fc) = '^Pa'ik- 1) •P[cr',cr]. 

a' 



( 4 ) 



We will consider inhomogeneous Markov chains defined by (4) where the tem- 
perature t is lowered in (2) at each single step. First of all we need to define 
suitable neighborhood relations as, for example, the following three: 

Neighborhood 1 (Nl) The transition [cr, cr'] is defined by r,r' G 77.([l,p]), 
r < r' , such that 



cr — (tTi , . . . , Tlr—l 5 , ^r-t-1 5 • • ■ — 1 ^ 5 , • • • 7 ^p) 5 

cr — (rTl, . . . ^TT^—i^TT^f^TTr' — l^ . . . j ? • ■ ■ 7 ^p} • 
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a' is derived from a by reversing the subsequence (tt^, . . . , Tr^')- 

Neighborhood 2 (N2) The transition [cr, cr'] is defined by r,r' G 7^([l,p]), 
r < r' , such that 



(7 — (tTi , . . . , TT'f—l , , TTj’-l-l , . . . , — l , 7T^/_|_1 , . . . , '7p ) , 

CT = (tTi, ■ • ■ , 7T^— 1, 7T^, 'Kr+li ■ • • i — + ■ ■ ■ i '^p)' 

The two elements and TTr> are exchanged in a to get a' . 

Neighborhood 3 (N3) The transition [cr, cr'] is defined by r £ 7^([l,p]) as 



cr — (tTi, ■ ■ • , TTj., . . . , TTp), 

/ I ('^1; • ■ • ) '^r+1) TTr, '^r+2) ■ • ■ j '^p)> 
cr = < 

I ('^p j '^2 ) ■ • ■ , '^p— 1 j '^1 ) , 



if r <p, 
if r = p. 



Two neighboring vertices are exchanged in a to give cr'. The neighborhood is 
made cyclic by exchanging tti and if r = p. If cr' can be obtained from a by 
some sequence of given transitions then we write cr' G Af* . 

In order to evaluate the computational complexity of each of the moves one 
should notice that the structure of the c-graph is modified by vertex elimination. 
This implies that either any new elimination sequence has to be run “from 
scratch” to evaluate its cost or all intermediate versions of the graph generated 
by the successive elimination of vertices have to be stored. For N2 the latter 
approach would enable us to start the evaluation of cr' with the graph resulting 
from the elimination of (tti, . . . , from G at the cost of a significant increase 

in memory requirement. We have decided to pursue the first strategy resulting 
in a computational complexity regarding the number of scalar floating-point 
multiplications performed which is quadratic in the number jl^j of vertices in G. 
In fact, the worst case complexity is complete graph K^y^. 

We will show that N2 satisfies all requirements for logarithmic simulated 
annealing as described in [12]. It is straight forward to show similar results for 
N1 and N3. First of all N2 is required to be reversible. 

Proposition 3 

Vct, cr' G V : cr' G Afrr <G> cr G Afcr' ■ 

Proof; This result is obvious and follows immediately from the definition 

of N2. □ 

Knowing that our transitions are reversible we need to show that there exist 
sequences of transitions transforming a into cr' for all pairs of configurations 
CT, cr' G V. 



Proposition 4 

Vct, cr' G V 3cto, . . . ,CTfc G V (ctq = crAak = cr') : G[CTi,CTi+i] > 0,z = 0, . . . , (A-l). 
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Proof; By (1) Uj] > 0 for all CTi, Oj G V with Uj G Mat - Furthermore, 
by exchanging two vertices at a time any permutation of the p intermediate 
vertices can be obtained, i.e. Vu G V : Af* = V. □ An immediate consequence 
of Prop. 3 and Prop. 4 is that N2 is identically reversible, i.e. if erg, ■ ■ ■ ,Ok G V 
are any elimination sequences as in Prop. 4 to obtain a' from cr then cr can be 
obtained from a' via a' = ak, ■ ■ ■ ,cro = a. 

Definition 1. cr' G V is reachable at height h from a € V if derg, . . . , cr^ G V 
(erg = a A ak = <j') such that G[ai, Oi+i] > 0, i = 0, . . . , (fc — 1), and C{ai) ^ h 
for all i = 0, . . . ,k. 

The height of a sequence of transitions transforming a into a' will be noted by 
a']. 

Proposition 5 

V/i : 'H[a, a'] ^ h 4A 'H[cr', a] < h. 

Proof; With G[ai,aj] > 0 for all ai,aj G V this is a direct consequence 
of N2 being identically reversible. □ 

Definition 2. Let (Jmin denote a local minimum, i.e. (Tmin G V \ Vmin and 
C(cr„in) < c(cr) for all a G \ a,,in- The escape depth T>(crmin) of cr„,i„ is 

the smallest Ah such that there exists o cr' G V where C(a') < C(amin) which is 
reachable at height C(CTmin) + Ah. 

According to Hajek’s result [12] asymptotic convergence of the logarithmic 
simulated annealing algorithm is guaranteed if the temperature in (2) is cho- 
sen as 

tor t = 

with r ^ maxo..i„ T’(crmin)- The detailed investigation of the energy landscape of 
VE is subject to ongoing research (see Sec. 4). The main outcome of this should 
be a better understanding of how to choose F appropriately. In the following 
section we will present some preliminary test results obtained by implementing 
various cooling schedules based on heuristic choices for F. 

3 Numerical Results 

The first test function was introduced in [19] and referred to as absorption func- 
tion. It is given as y = f{x) with / ; IR" — >• M defined by 

n—l / Bi \ 

2/= n 

i=0 \ j=l J 

where ipi : M ^ M and ^ 1 for z = 0, . . . , (n — 1). Fig. 3 shows an instance of 
(5). Our discussion will be based on c-graphs G of this shape with variable pa- 
rameters n and Cj, z = 0, . . . , (n— 1). We will consider two special cases defined by 
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Fig. 3. Absorption Function 



e, = i + l,i = 0,...,9, (ABSl) and Ci = 3, i = 0, . . . , 29, (ABS2) respectively. 
One interesting property of (5) with respect to VE is that its optimal solution is 
a well-defined combination of forward and reverse vertex elimination applied to 
certain parts of G. In fact, the products rui(ej) = ipi-Xi- . . = 0, . . . , (n— 1), 

have to be eliminated in forward mode followed by the elimination of the interme- 
diate vertices oi W\-W 2 - ■ ■ ■■ Wn-i in reverse mode. This leaves us with n vertices 
(representing the Wi, i = 0, . . . , (n— 1)) to be eliminated at the cost of one multi- 
plication each. Consequently, we can immediately provide expressions for the cost 
of running a forward vertex elimination sequence (VFM: n + Cj -I- ( 2 ) — 1), 

a reverse vertex elimination sequence (VRM: 2 • (n — 2) -I- 2 • ^i + n), and 

an optimal vertex elimination sequence (VOPT: n + YTiZo e. -b 2 • (n - 2)). For 
ABSl we get the values 109, 136, and 81, respectively. ABS2 results in 554, 
266, and 176. 

Our simulated annealing algorithm starts with the worst choice out of VFM 
and VRM. Of course, we could as well have chosen a random elimination se- 
quence as starting point. In any case we do not want the cost of this initial 
sequence to be close to the minimum already. By picking the maximum out of 
VFM and VRM we can satisfy this requirement in most cases. Additionally, 
minjVFM, VRM} often gives us a useful reference value for comparison with 
the result delivered by the annealing process. (See, for example, annealing of 
Chebychev Quadrature problem [1] in [23].) 

We have combined each of the three neighborhood relations Nl, N2, and 
N3 with a cooling schedule defined by F = 2 leading to simulated annealing 
algorithms SAVl, SAV2, and SAV3. Fig. 4 and Fig. 5 show the development 
of the objective function over the first one thousand iterations for ABSl and 
ABS2, respectively. It took approximately five seconds on an INTEL Pentium- 
Ill 233 MHz system to complete one annealing process. This runtime could be 
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Fig. 4. Simulated annealing of ABSl 
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Fig. 5. Simulated annealing of ABS2 



respectable even when thinking of simulated annealing as a compiler optimiza- 
tion pass (see Sec. 4). No significant improvements were achieved by increasing 
the number of iterations to five or even ten thousand. While N1 and N2 seem 
to perform well in both cases N3 results in a slower convergence, especially for 
ABSl. This suggests that it may either be a poor neighborhood choice or that 
the cooling was in fact too quick. In order to investigate this problem further we 
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have applied a slower cooling schedule with 

t = t{k) = £ — , r = 2, fc = 0,...,5000, 

ln(v«: + 1) 

to ABSl. After a considerably increased runtime of approximately 30 seconds 
the algorithm returned 108. This result does not represent a significant improve- 
ment. The values delivered by SAVl and SAV2 were not reached. We conclude 
that simulated annealing based on N3 in combination with logarithmic cooling 
schedules for inhomogeneous Markov chains is less suitable for our purposes. 

Research in the field of optimizing the accumulation of Jacobian matrices by 
elimination techniques in c-graphs [24] has only recently been starting to deliver 
both theoretical and numerical results. First attempts to use vertex elimination 
in connection with the greedy Markowitz heuristic have lead to first numerical 
results published in [11]. However, most results suitable for comparison with the 
simulated annealing approach were presented in [19]. Standard techniques of AD 
such as forward and reverse vector modes [10], seed matrix compression tech- 
niques [8], [25], or sparse forward and reverse modes [4] are all aimed towards the 
minimization of the computational effort for accumulating derivatives. However, 
in virtually all cases their performance can not keep up with elimination tech- 
niques, provided that the latter are applicable (see Sec. 4). The following table 
shows how the simulated annealing algorithms compare with VFM, VRM, and 
with the known optimum. The cost of the start configuration and the minimum 
achieved by simulated annealing are highlighted. 





VFM 


VRM 


SAVl 


SAV2 


SAV3 


VOPT 


ABSl 


109 


136 


96 


87 


117 


81 


ABS2 


554 


266 


194 


206 


248 


176 



The logarithmic cooling schedule was introduced because it is possible to 
prove its asymptotic convergence [12]. However, the schedule is widely regarded 
as impractical. Often comparable or even better results can be obtained by a 
simple fixed-temperature Metropolis algorithm. We have implemented the latter 
for various temperatures t while fixing the number of iterations to be performed 
(5000 in this case). The algorithm was applied to ABSl and the development 
of the objective function is illustrated in Fig. 6. For t = 0.5 the algorithm does 
not seem to converge. “By luck” the case t = 0.4 produces a good elimination 
sequence early in the annealing process. However, it diverges from this solution 
soon due to the acceptance of worse sequences as the algorithm proceeded. The 
variant where t = 0.1 appears to be too restrictive in the acceptance of new 
elimination sequences resulting in a slower convergence. Both t = 0.2 and t = 0.3 
turn out to be good choices for the temperature. Again, the algorithm with 
t = 0.2 generates a very good sequence early in the annealing process which 
results in a slightly better final solution. These results correspond to the ones 
obtained using logarithmic cooling schedules. Fig. 7 shows the development of 
the temperature over the first one thousand iterations with F = 2. Most of the 
time the algorithm runs at a temperature between 0.35 and 0.29. 
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Fig. 6. Fixed-temperature Metropolis Algorithm for ABSl 
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Fig. 7. Development of Temperature for F = 2 
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In order to test the robustness of N2 in a less predictable environment we 
have generated four c-graphs at random with the following specifications: RGl 
(n = 1, p = 100, m = 10, imin = 2, and fmax = 4), RG2 (n = 10, p = 100, m = 10, 
^min — 2, and ^max — R-G'S (tz — 20, p — 100, TTi — 5, Zmin — 2, and Zuiax — 
RG4 (n = 5, p = 200, m = 5, Zmin = 2, and imax = 3), where imin and Zmax 
denote the minimal and maximal numbers of successors of any independent and 
intermediate vertex in the c-graph, respectively. The same logarithmic cooling 
schedule with T = 2 was applied to all four problems for one thousand steps and 
a start configuration START which, again, was chosen as the worst out of VFM 
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Fig. 8. Simulated annealing of randomly generated c-graphs 



and VRM. The minima obtained by simulated annealing (VSA) for RG2 and 
RG3 lay already below the best values achieved by other known algorithms [20] , 
[21] (VMIN) and are highlighted in the table below. The development of the 
value of the objective function is illustrated in Fig. 8. 





START 


VSA 


VMIN 


RGl 


1526 


152 


95 


RG2 


810 


305 


344 


RG3 


1087 


765 


865 


RG4 


1925 


1132 


529 



4 Outlook 

The generation of optimized derivative code for mathematical models given as 
computer programs is considered as a compile-time activity. Thinking of differ- 
entiation enabled compilers [18] the runtime of algorithms for optimizing elim- 
ination sequences represents an important factor. In this situation simulated 
annealing is likely to come into play only if the runtime of the algorithm can be 
fixed a priori as in Sec. 3 or if the generated code will be used repeatedly over 
a long period of time. This is when the effort of experimenting with different 
neighborhood relations and cooling schedules might pay off in form of a highly 
efficient derivative code. 

In order to be able to apply elimination techniques the c-graph of the func- 
tion has to be built at compile time. In general, computer programs contain 
branches and loops with variable bounds so that this requirement can not be 
met. However, c-graphs of basic blocks [2] can often be obtained easily unless 
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effects such as aliasing [2] occur. The exploitation of the theoretical results in 
this field is subject to ongoing research and development [28]. 

So is the theoretical investigation of OJA, in general. In [22] we could show 
that vertex elimination is not general enough to yield the solution of OJA for 
all c-graphs. Neither is edge elimination as introduced in [19]. A quantitative 
characterization of the vertex and edge discrepancies remains one of the un- 
solved problems in AD. We are currently trying to develop simulated annealing 
algorithms for both edge and face elimination as defined in [24]. So far, both 
problems turned out to be much harder to handle than pure vertex elimination. 
While vertex elimination sequences are simply permutations of the intermediate 
vertices both edge and face elimination sequences change dynamically due to 
new edges being introduced (fill-in). 

An important part of our research is dedicated to the development of ro- 
bust and efficient algorithms (both sequential and parallel) for finding nearly 
optimal elimination sequences. Colleagues from the University of Cranfield / 
Royal Military College of Sciences, Shrivenham, UK, are developing a compiler 
infrastructure which will allow us to parse real world application programs into 
internal representations suitable for elimination techniques. A quasi-standard 
interface between the compiler and our optimization module is being developed 
to allow for easy communication of c-graphs and elimination sequences [26]. 

Finally, we are working towards a comprehensive complexity analysis of the 
various elimination problems. Herley [13] could show that the problem of min- 
imizing the fill-in is NP-complete under vertex elimination. The closely related 
VE is conjectured to exhibit the same property. Further research is required to 
prove analogous results for more general elimination techniques as introduced in 
[24]. 
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Abstract. We want to fill a given two-dimensional closed contour as 
accurately as possible with a fixed number of identical, flexible objects. 
These objects have to be packed sequentially. They adapt themselves 
to the surface they are packed on, but their deformatiou can only be 
simulated. This type of problem is the two-dimensional cross-section of 
manufacturing processes where soft material is wound onto a mandrel. 
We formulate this as a problem of dynamic programming with simulated 
law of motion. It allows an evolutionary algorithm approach that suc- 
cessfully produces approximate solutions in a non-sequential fashion. 

Keywords: Evolutionary algorithms, optimization and simulation, 
packing problems, dynamic programming. 



1 Introduction 

We consider the problem of filling a given two-dimensional container with a fixed 
number N of objects of identical type. The objects are smooth and adapt them- 
selves to the surface their are packed on. The packing has to start at the bottom 
of the container and the objects have to be placed sequentially, see Fig. 1. The 
aim is to fill the container, which is given as a closed two dimensional contour, 
as accurately as possible. The particular additional constraint here is that the 
exact behaviour of the objects, their deformation function, is not available in 
a closed analytical form. Instead, it has to be simulated based on assumptions 
about properties of the material. 

Hence, we have to solve an optimization problem with a simulated target 
function. We use a dynamic programming framework to give a precise definition 
of our problem. The dynamics of the model depend on the simulated deformation 
of the objects, which rules out most solution methods of dynamic programming 
like backward induction. Instead, we suggest an approximate procedure that 
consists of a genetic algorithm framework optimizing the relative position of the 
objects using the results of the simulation as fitness. 

The main point here is to find a coding of the feasible solutions and of the 
cost function (i.e. the deviation of the final layout from the target contour) such 
that genetic algorithm techniques can be applied efficiently. With our approach 
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standard genetic operators yield feasible solutions without any problem specific 
adaptations. Also, these operations behave ’locally’ enough to produce offspring 
that inherits properties of their parents leading to astonishingly good solutions 
in very short time for a number of real world problems. 

The problem described is the two-dimensional cross-section of certain three- 
dimensional packing problems that arise in the manufacturing of rotational sym- 
metrical bodies like tubes, rings or tanks from fibre-reinforced plastic. Here, a 
continuous fibre bundle impregnated with a polymere is wound around a rotating 
mandrel, see e.g. [6] for technical details. The shape of the mandrel determines 
the inner contour of the workpiece, the outer contour is formed by the layers of 
the composite fibre bundles after hardening or cooling. So the winding robot has 
to put more layers at places where the side of the workpiece is to be stronger 
and less where there is to be a groove. Generally the aim is to control the robot 
in such a way, that a predefined outer contour is filled as accurately as possible 
with a given number of windings. As the spooled material is soft and sticky, it 
will adapt to the surface or is even pressed on to it to prevent air inclusion. 

Due to the rotational symmetry, we can restrict the problem to (one half of) 
the 2-dimensional cross-sections of the workpiece and of the rope to be spooled, 
see Fig. 1. Here, (a) shows the cross-section of a ring. The cross-section of the 
rope when wound onto an even surface will look like Fig. 1 (b), this will be 
referred to as an object. Fig. 1(c) shows one half of the ring (the target shape) 
with four objects placed, i.e. with four rounds of rope already applied. The target 
shape consists of the lower starting and the upper target contour . We assume 
that the axis of the mandrel is parallel to the x— axis as indicated in the Figue. 
1. We assume further that the number of objects is chosen appropriately (i.e. 
the area of the target shape is N times the area of the objects). Note that we 
neglect possible problems in the original 3-dimensional problem that may occur 
when ’changing the lane’ during winding. 

Classical problems of packing and cutting, see e.g. [2] mostly consider objects 
of fixed shape, an exception is [4] . Methods for solving include dynamic program- 
ming, integer programming, problem specific algorithms and several heuristics, 
see [2]. In [3], simulated annealing is applied to a simple packing problem. 




Fig. 1. Target shape and objects. 
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The paper is organized as follows. In Section 2, we give a formulation of the 
problem in a dynamic programming set-up and define the solution. In Section 3 
we sketch how to simulate the deformation of a single object and how to evaluate 
the cost of a complete solution approximately. In Section 4 we then present a 
genetic algorithm that produces good solutions in a non-sequential way. In the 
final Section 5 we report on some practical experiences with an implementation 
of the algorithm. 

2 The Mathematical Model 

Recall that a dynamic programming model (see [1]) consists of states s , actions 
a , a transition function T(s, a) that maps state-action-pairs into new states, a 
cost structure with local costs c(s, a) connected to each transition and a terminal 
cost function Vo(s) for the final state. 

Our packing problem fits into this framework if we regard the decision where 
to place the next object as an action. Then N := no. of objects to be placed 
equals the number of optimization stages. The state of the packing has to contain 
all information relevant for the further placement, in particular, the present 
upper contour formed by the objects already placed. This will be made precise 
in the following sub-sections. 

2.1 Feasible Contours 

We describe the contour by two functions for its upper and lower half. The set 
of possible contour functions is 

C := {C : [a,/3] — >■ IR+ | C bounded, continuous and piecewise differentiable} 

with a < f3,a, /3 € IR. We assume that the given starting contour Co and the 
target contour C belong to £ with Co{x) < C{x) for x G (o',/]) and '■= 
Co{a) = C{a), y/s := Co{(3) = C{(3), see Fig. 2. During the placement, the 
starting contour Cq is changed to contours Ci < C 2 < . . . < Cat G £ by adding 
on the objects. 

We characterize each point on a contour C G £ by its distance from the left 
endpoint of C measured in arc length, ’C— distance’ for short. To do so, let 

Ac(x) := f \/l + C'{zY dz, x€[a,P], C G £ 

J a 

denote the C— distance of the point {x,C{x)) from the left starting point 
(a, C(a)) measured in arc length along C. 
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Fig. 2. A target shape with starting contour Co and target contour C, point {x, C{x)) 
has distance Aq{x) from the left endpoint. 



2.2 The State Space 

Next we define the state for the dynamic programming model described above. 

We assume that the winding robot moves from one side of the target shape 
to the other and back again, i.e. it produces complete layers of material and may 
not change its direction freely. This restricts the space of possible solutions but 
simplifies matters considerably. 

For a given starting contour Cq € € with t/q, = Co{a),yp = Cq{(3) we define 
the state space 

S := {{C,t,d) I C G £, C{a) = ya,C{P) = yfj, C{x) > Co(x),a < x < P, 

and 0 < t < AciP), d G {—1, +1} }. 

Here, the state s = (C,t,d) indicates that the objects placed so far form an 
upper contour C and that the location of the last object placed is represented 
on (7 by t (in (7— distance), d = +1 indicates that the present layer runs from 
left to right, d = —1 indicates the opposite direction. Note that the last object 
was not placed on C\ point t only represents that location, see details below. In 
the starting state (Co,t:d), t indicates an arbitrary reference point for the first 
placement. 

2.3 Action and Placement 

As described above, the placement of an object has to be formulated as action 
in the dynamic programming model. 

We assume that the objects all have identical shape before placement (i.e. 
the rope of material has a constant cross-section) and that the length L of their 
lower contour (base line) and the volume of the cross-section when placed on 
a surface are constant, see Fig. 1 and 3. As reference point for the placement 
(basepoint) we take the middle of this base line, it will be marked by a black 
triangle in the figures below. 

Let A > 0 be the maximal (7— distance between two successive objects that 
can be managed by the winding machine. Then, action a G [0,1] applied to state 
(C,t,d) means to place the next object with its basepoint at a (7— distance of 
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Fig. 3. An object is placed on C with its basepoint at C— distance t + aX. 



daX from t, i.e. at C— distance t + daX from the left endpoint, see Fig. 3. Near 
the left and right border of the target shape this may lead to infeasible positions 
^ [0, rlc(/3)]. In this case the direction d is changed to —d and the object is 
placed at the next feasible position, hence the exact position of the new object 
is at a C— distance r(s, a) where 



r(s,a) = T((C,t,d),a) 



t dci\ 


if L/2<t + daX < Ac{/3) - L/2 


L/2 


if t + daX < L 2 


AcW) - L/2 


ift + daX > Ac{P) — L/2 




( 1 ) 



and the direction after placement is given by 



5(s, a) = S((C, t, d), a) 



d if L/2 < t + daX < Acifi) — L/2 
—d else 



Note that this notion of feasible placement does not prevent objects to be placed 
completely outside of the target shape, see Fig. 6 below for an example. 



2.4 The State Transition 

The state transition function has to describe the contour after a placement of a 
new object. Let the local deformation function for an object 

l^l{C,r,l) 

be a bounded, continuous and piecewise differentiable function that gives the 
height of the object I units (in C— distance) from its left endpoint if the object 
is placed on C at a C— distance r, see Fig. 4. In the next Section, we shall give 
an idea how to simulate 7 . 

The new contour after placing an object in state s = (C, t, d) G S with a 
relative offset of a G [ 0 , 1 ] is now given by 

( C(x) for Ac{x) < t{s, a) — L /2 

Cs,a{x) := < C{x) + ^{C,T{s,a),l) for Ac{x) = t{s, a) — L/2 + I, 0<1<L 

[ C{x) for Ac(x) > t{s, a) + L/2 



( 2 ) 
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Fig. 4. The object is placed at C— distance r. Its height I units from its left endpoint 
is given by 7 (C, r, 1). 




Fig. 5. The reference point for the next offset is marked by a black dot. It is a projection 
of the basepoint (black triangle) of the last placement onto the new surface. 



for a < X < (3. Note that the region outside t{s,o) ± Lj2 refers to that part of 
the contour that is left untouched by the newly placed object. Obviously, Cg,a 
belongs to € for all s, a. 

We finally have to determine how the location r(s, a) at which the last object 
was placed on C is represented on the new contour Cg^a as reference point for the 
next placement. We use the projection of the basepoint of the last object onto the 
new contour where the projection is perpendicular to the line connecting left and 
right endpoint of the object, see Fig. 5. From simple geometric considerations we 
obtain that the x— coordinate Xz = Xz(s, a) of this point is the smallest solution 

y of 



{y - z){x - x)/{C{x) - C{x)) + C{z) = Cs,a{y)- (3) 

where x, x and z are as in Fig. 5. Now the complete transition function for the 
dynamic programming model with local deformation functions 7 can be defined 
from (2) and (3) as 

T(s,a) = T{{C,t,d),a) := {Cs,a,Ac^^^{xzis,a)),S{s,a)). (4) 



2.5 The Cost Structure 



We have no local costs in our model but a terminal cost function 



Vo{s) = Vo{{C,t,d)) 



rP 

/ I C{x) — C{x) I dx. 

J a. 



( 5 ) 



which gives the deviation from the target contour in the final state. 
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Now the dynamic programming problem is completely determined. A solution 
to this problem is a sequence of offsets 

ao,...,a7v-i G [Oj 1]^ 

for the consecutive placement of the N objects on the starting contour Co- Note 
that any sequence from [0, 1]^ constitutes a feasible solution. Its costs are given 
by 



Uat(so, oo, . . . , Oiv-i) = Vo{T{. . . {T{T{so,ao),ai), . . . ,aN-i) 

— ( 6 ) 
= / I Cn{x) — C(x) I dx 

J a. 

where Cat is the final upper contour after applying oq, . . . , a^-i to sq. An optimal 
solution for starting state Sq = (C'o, t, d) is a sequence Og, . . . , G [0, 1]^ 

that minimizes (6) over [0, l]'^. 

3 Simulating the Local Deformation 

We only sketch the simulation very briefly as it is not in the focus of the present 
paper. 

In our experiments it turned out to be sufficiently accurate to use piecewise 
linear functions for the contours as well as for the objects, and hence for the 
deformation function 7. This greatly simplifies the calculation described above 
and also the representation of a contour on the computer. 

Let 7o : [0, L] — >■ IR+ denote the piecewise linear upper contour of the object 
when placed on an even surface. We add 70 to the present contour at the location 
given by r(s,a). Let F denote that part of the new contour Cg^a that is formed 
by 7o- 

We use two different levels of simulation to improve F. The first level assumes 
that the surface of an object will always tend to form a section of a circle. From 
simple geometric considerations we can determine a circle that runs through the 
two endpoints of the object placed at t(s, a) {x,x in Fig. 5) and encloses the 
exact volume of the object. We then approximate the relevant section of the 
circle by a polygone which is slighly changed afterwards to enclose the correct 
volume. This yields an improved deformation function 71. 

This is a very fast procedure that works well as long as the surface C is not 
too ragged. Otherwise the circle may intersect with C. This has to be detected 
and then the second more detailed level of simulation has to be called. Here, 
the polygone describing F is optimized to yield smallest length and minimal 
distortion with the correct enclosed volume. This is done using a standard library 
for non-linear optimization. The result is a simulated contour Cg^a which takes 
into account more physical properties of the material. 

For a given starting state Sg and a sequence of off-sets Og, . . . , ojv-i G [0, 1] 
we evaluate the cost function Uat(so, og, . . . , oat-i) as given in (6) successively 
by N calls of the simulation. 
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4 An Evolutionary Algorithm 

We shall now describe an evolutionary heuristic algorithm that yielded excellent 
solutions in our experiments. Let us briefly collect the main ingredients of ge- 
netic algorithms (see e.g. [5]). A starting population of solutions (’individuals’) 
is created at random. It is subject to random genetic operations that produce 
off-spring solutions. Standard operators are crossover of two randomly selected 
solutions and random mutation of a solution. Repeated application of these op- 
erators enlarges the population which is then reduced by a selection mechanism 
to its former size. This cycle (’ generation’) is repeated for a certain number 
of times. The selection prefers solutions with low costs. Thus the members of 
the successive populations tend to become good solutions of the optimization 
problem. There are many different ways to select solutions for the operations 
and to take into account their ’fitness’, i.e. the cost of the solution. 

In the preceding Sections we have formulated our problem such that the 
structure of a solution uq, . . . ,aN-i is extremely simple : it lies in the unit 
rectangle [0,1]^ and each vector in this space represents a feasible placement 
of N objects. Therefore we may use simple standard genetic operators without 
any problem specific mending operation. The fitness of a solution aoj • • • ,ciN-i 
is measured by the cost function Vjv(sq, ag, ■ ■ ■ , ajv-i) as defined in (6). Note 
that we minimize fitness. 

For crossover we use a one-point or uniform operator. The one-point crossover 
takes two randomly selected solutions (oq, . . . , uat-i) and (a'g, . . . , o)v-i) 
ents, selects a random position I G {0,...,fV — 1} and returns the solution 
(ao, . . . , ap . . . , a(y_]^). In terms of controlling the winding machine this 
means to use off-sets from the first solution for the first I rounds and then to fol- 
low the second solution. Uniform crossover selects each position in the resulting 
solution independently from one of the parents. 

Note that we restricted changes of the direction d to the lateral borders where 
it is calculated by the cost function. Including changes of direction into the so- 
lution would cause problems with crossover operators, as the impact of a change 
in direction depends heavily on its absolute location within a placement which 
in general will be changed during a crossover. Then the offspring individuum 
will be quite different from its parents which would make the algorithm search 
rather arbitrarily in the solution space. 

We apply mutation to each new offspring solution created by crossover. Mu- 
tation operators randomly change single offsets of a solution. Again, even a small 
change of a single offset may lead to a completely different solution, as all subse- 
quent objects are shifted. We therefore use a ’local’ mutation that also changes 
the next offset to neutralize the widerange effect of the mutated offset. The pro- 
cedure is as follows: select an index 0 < i < N — 2 randomly with a bias towards 
higher values. In the solution to be mutated, let a = Oi + a^+i. Select a new 
value a( uniformly distributed over 



max{l,cr} — 1, min{l,CT} 
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and put = tr — a'. Then o', £ [0, 1] and a' + = a, i.e. the position 

of objects t + 2, i + 3, . . . on the their contour is unchanged. Note however, that 
the contours themselves will have changed due to the moving of objects i and 
i + 1. This may be illustrated by looking at the Oq, . . . ,aAr_i as distances of 
beads on a string. Mutation then moves exactly one of the beads and leaves the 
others unchanged. 

Let M be the population size. Crossover and mutation are repeated for a 
fixed number M' of times enlarging the population to M + M' elements . The 
selection mechanism then takes M solutions from the enlarged population to 
form the next population. These may be the M solutions with lowest (simulated) 
costs V/v , or they may be drawn with a probability density proportional to the 
VAT-values or proportional to their rank with respect to these cost values. Again 
these are standard selection procedures. 

We provided the system with a graphical user interface , that allows to display 
the evolution of the population with their cost values. The evolution may be 
stopped and the individuals may be inspected graphically as shown in Fig. 6 
and 7 below. 



5 Experimental Results 

We applied our model to several contours supplied by our industrial partner. 
Typically there were about 60 objects to be placed. Target contour and the 
cross-section of the object were given as polygones, similar to Fig. 1. 

A starting population was created randomly, good results were obtained with 
individuals that had constant offsets qq, , a^-i with different random values 
for different individuals. As starting state (Co,t, — 1) we used the lower contour 
Cq with its middle point as reference point for the first object. 




Fig. 6. Two individuals from a starting population. 



Fig. 6 shows two individuals from a starting population. The one on the 
left was produced with econstant offsets, the other with random offsets. Note 
that these are obtained by placing a random vector of offsets into the contour 
using the simulation of the cost function as described in Section 3. The much 
improved results in Fig. 7 show the two best individuals obtained after about 
300 generations of the genetic algorithm which took about 30 sec of cpu time on 
a 650 MHz Pentium. 
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Fig. 7. Two individuals from a later population. 



Results from this system were fed into a real winding machine. The contours 
obtained in reality were in very good accordance with our simulated contours. 
This shows that our simple model of deformation reflects reality with sufficient 
accuracy. 

6 Conclusions 

We modelled a complex technical process of placing flexible objects into a given 
contour as a deterministic dynamic programming problem. The classical sequen- 
tial solution methods are not applicable as the main ingredient of the model, the 
state transition function is not known. It can only be approximated by a separate 
local optimization that simulates the physical properties of the flexible material. 

We therefore took a non-sequential approach, that optimizes all placement 
decisions at once with a genetic algorithm. The formulation of the model allowed 
to produce feasible solutions with standard genetic operators without the need 
of problem specific mending operations. 

In real world applications with about 60 objects this algorithm produced 
astonishingly good results in just a few seconds. 
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Abstract. Recently, we have developed a learning model, called sto- 
chastic finite learning, that makes a connection between concepts from 
PAG learning and inductive inference learning models. The motivation 
for this work is as follows. Within Gold’s (1967) model of learning in the 
limit many important learning problems can be formalized and it can be 
shown that they are algorithmically solvable in prineiple. However, since 
a limit learner is only supposed to converge, one never knows at any 
particular learning stage whether or not it has already been successful. 
Such an uncertainty may be not acceptable in many applications. The 
present paper surveys the new approach to overcome this uncertainty 
that potentially has a wide range of applicability. 

Keywords: Inductive inference, average-case analysis, stochastic finite 
learning, conjunctive concepts, pattern languages. 



1 Introduction 

Let us assume that we have to deal with the following problem. We are given 
a concept class C and should design a learner for it. Next, suppose we already 
know or could prove C not to be PAG learnable. But it can be shown that C 
is learnable within Gold’s [6] model of learning in the limit. Gonsequently, there 
is a learner behaving as follows. When fed any of the data sequences allowed 
in this model, it converges in the limit to a hypothesis correctly describing the 
target concept. Nothing more is known. Since it is either undecidable whether 
or not the sequence of hypotheses has already converged or, even if it decidable, 
it practically infeasible to do so, one never knows whether or not the learner 
has already converged. But such an uncertainty may not be tolerable in many 
applications. So, how can we recover? 

In general, there may be no way to overcome this uncertainty at all, or at least 
not efficiently. But if the learner satisfies some additional assumptions outlined 
below then there is a rather general way to transform learning in the limit into 
stochastic finite learning. Within this survey, we aim to describe our method 
developed and to exemplify it by using some well-known concept classes. 

It should also be noted that our approach may be beneficial even in case that 
the considered concept class is PAG learnable. 

K. Steinhofel (Ed.): SAGA 2001, LNCS 2264, pp. 155-171, 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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2 Learning in the Limit 

Gold’s [6] model of learning in the limit allows one to formalize a rather general 
class of learning problems, i.e., learning from examples. For defining this model 
we assume any recursively enumerable set X and refer to it as the learning 
domain. By w.p. (ft) we denote the power set of X . Let Csw.p. (ft) , and let 
c G C be non-empty; then we refer to C and c as a concept class and a concept, 
respectively. Let c be a concept, and let t = be any infinite sequence of 

elements Xj £ c such that range(t) =df (xj | j G N} = c. Then t is said to be 

a positive presentation or, synonymously, a text for c . By text{c) we denote the 
set of all positive presentations for c . Moreover, let t be a positive presentation, 
and let y G N . Then, we set ty = xq, . . . ,Xy , i.e., ty is the initial segment of t 
of length y -I- 1 , and ty =df {xj | j < y} . We refer to ty as the content of ty . 

Alternatively, one can also consider complete presentations or, synonymously, 
informants. Let c be a concept; then any sequence i = {xj,bj)j^j^ of la- 
beled examples, where bj G {-I-, — } such that {xj \ j G N} = A and 

= {xj\ {xj,bj) = {xj,+), j G N} = c and i~ = {xj\ {xj,bj) = (xj,-), j G 
N} = A \ c is called an informant for c . For the sake of presentation, the 
following definitions are only given for the text case, the generalization to the 
informant case should be obvious. We sometimes use the term data sequence to 
refer to both text and informant, respectively. 

An inductive inference machine (abbr. IIM) is an algorithm that takes as 
input larger and larger initial segments of a text and outputs, after each input, a 
hypothesis from a prespecified hypothesis space H = (hj)j^fq . The indices j are 
regarded as suitable finite encodings of the concepts described by the hypotheses. 
A hypothesis is said to describe a concept c iff c = h . 

Definition 1. Let C be any concept class, and let H = {hj)j^fi be a hypothesis 
space for it. C is called learnable in the limit from text iff there is an IIM M 
such that for every c G A and every text t for c , 

(1) for all n G N+ , M(t„) is defined, 

(2) there is a j such that c = hj and for all but finitely many n G N+, 
M{tn) = j . 

By LIM we denote the collection of all concepts classes C that are learnable 
in the limit from text. Note that instead of LIM sometimes EX is used. 

It can be shown that many interesting concepts classes are learnable in the 
limit from text and informant, respectively. However, it should be noted that 
there are also some major points of concern. First, in the definition given above, 
the limit learner has always access to the whole actual initial segment of the 
data sequence provided. Clearly, in practice such a learner would run out of 
space pretty soon. This problem can be overcome be considering incremental 
learning algorithms. We do not develop this topic here any further. Instead, the 
reader is referred to [3,13]. 

Second, since a limit learner is only supposed to converge, one never knows 
at any particular learning stage whether or not it has already been successful. 
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Such an uncertainty may be not acceptable in many applications. Clearly, one 
could modify the definition of learning in the limit by adding the requirement 
that convergence should be decidable. Then one would arrive at finite learning 
(cf. Gold [6]). However, the power of finite learning is much smaller than that of 
learning in the limit. Finally, it seems difficult to define an appropriate measure 
for the complexity of learning in the limit (cf. Pitt [15]). 

Note that the latter points of concern are really serious, since a major task of 
algorithmic learning theory consists in providing performance guarantees. Thus, 
it is only natural to ask why we not simply refrain to consider learning the limit. 
Alternatively, we could try to use the well-known PAG model (cf. [20]). Our 
answer is twofold. During the last decade many interesting concept classes have 
been proven to be not PAC learnable. Moreover, even if a concept class is PAG 
learnable, the performance guarantees provided by this model are often much 
too pessimistic. So, how can we recover? 

Though in general there may be no way to recover, we recently discovered 
a rather general method that allows one to overcome the difficulties described 
above. In the remaining part of this paper, we outline this approach. The key 
ingredient is an average-case analysis of the learning algorithms considered as 
initiated by Zeugmann [22]. As it turned out, the most promising complexity 
measure to be considered is the so called total learning time, i.e., the overall 
time taken by a learner until convergence. Formally, the total learning time is 
defined as follows (cf. Daley and Smith [4]). Let C be any concept class, and let 
M be any IIM that learns C in the limit. Then, for every c € C and every text 
t for c, let 

Conv{M, t) =df the least number m G N'*' 

such that for all m, M(t„) = M{tm) 

denote the stage of convergence of M on t (cf. [6]). Note that Conv{M,t) = oo 
if M does not learn the target concept from its text t. Moreover, by TM{tn) 
we denote the time to compute M{tn) . We measure this time as a function of 
the length of the input and call it the update time. Finally, the total learning 
time taken by the IIM M on successive input t is defined as 

Conv{M,t) 

TT{M,t) =df ^ TM^tn). 

n—1 

Glearly, if M does not learn the target language from text t then the total 
learning time is infinite. 

It has been argued elsewhere that within the learning in the limit paradigm a 
learning algorithm is invoked only when the current hypothesis has some problem 
with the latest observed data. Glearly, if this viewpoint is adopted, then our 
definitions of learning and of the total learning time seem inappropriate. Note 
however, that there may be no way at all to decide whether or not the current 
hypothesis is not correct for the latest piece of data received. But even if one 
can decide whether or not the latest piece of data obtained is correctly reflected 
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by the current hypothesis, such a test may be computationally infeasible. For 
example, consider the well-known class of all pattern languages (cf. Angluin [1]). 
The membership problem for the pattern languages is MV -complete. Thus, 
directly testing consistency would immediately lead to a non-polynomial update 
time unless V = J\fV . Moreover, at least for the informant case there is no 
polynomial time update learning algorithm at all that infers the whole class of 
pattern languages (cf. [21]). Thus, adopting the definition of the total learning 
time given above is reasonable. 

Next, one has to assume an appropriate class of probability distributions V 
over the relevant learning domain. Ideally, this class should be parameterized. In 
what follows, it is assumed that the data sequences are drawn at random with 
respect to any of the distributions from V . Next, it is advantageous to introduce 
a random variable CONV for the stage of convergence. Note that CONV can 
be also interpreted as the total number of examples read by the IIM M until 
convergence. The first major step of the average-case analysis consists now in 
determining the expectation E[CONV] . Clearly, E[CONV] should be finite 
for all concepts c G C and all distributions D G T> . 

Next, Markov’s inequality provides us with the following tail bounds: 

Pt{CONV > t ■ E[CONV]) < I . 

Having reached this stage, learning in the limit can be transformed into 
stochastic finite learning. 

3 Stochastic Finite Learning 

A stochastic finite learner is successively fed data about the target concept. Note 
that these data are generated randomly with respect to one of the probability 
distributions from the class of underlying probability distributions. Additionally, 
the learner takes a confidence parameter S as input. But in contrast to learning 
in the limit, the learner itself decides how many examples it wants to read. Then 
it computes a hypothesis, outputs it and stops. The hypothesis output is correct 
for the target with probability at least 1 — (5 . 

Some more remarks are mandatory here. The description given above ex- 
plains how it works, but not why it does. Intuitively, the stochastic finite learner 
simulates the limit learner until an upper bound for twice the expected total 
number of examples needed until convergence has been met. Assuming this to 
be true, by Markov’s inequality the limit learner has now converged with proba- 
bility 1/2 . All what is left, is to decrease the probability of failure. This can be 
done by using again Markov’s inequality, i.e., increasing the sample complexity 
by a factor of 1 /S results in a confidence of 1 — d for having reached the stage 
of convergence. 

Finally, it remains to explain how the stochastic finite learner can calculate 
the upper bound for E[CONV] . This is precisely the point where we need the 
parameterization of the class T> of underlying probability distributions. Since in 
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general, it is not know which distribution from T> has been chosen, one has to 
assume a bit of prior knowledge or domain knowledge provided by suitable upper 
and/or lower bounds for the parameters involved. A more serious difficulty is to 
incorporate the unknown target concept into this estimate. This step depends 
on the concrete learning problem on hand, and requires some extra effort. We 
shall point to some examples at the end of this section. 

Next, we formally define stochastic finite learning. 

Definition 2. Let V be a set of probability distributions on the learning do- 
main, C a concept class, H a hypothesis space for C , and S G (0,1) . {C,T>) 
is said to be stochastically finitely learnable with <5 -confidence with respect to 
H iff there is an IIM M that for every c € C and every D € T) performs as 
follows. Given any random data sequence 6 for c generated according to D , 
M stops after having seen a finite number of examples and outputs a single hy- 
pothesis h G H ■ With probability at least 1 — <5 (with respect to distribution D ) 
h has to be correct, that is c = h . 

If stochastic finite learning can be achieved with d -confidence for every <5 > 0 
then we say that (C,T>) can be learned stochastically finite with high confidence. 

Note that our model of stochastic finite learning differs in several ways from 
the PAC model. First, it is not completely distribution independent. Thus, from 
that perspective, this variant is weaker than the PAC-model. But the hypothesis 
computed is exactly correct with high probability. That is, we do not measure the 
quality of the hypothesis with respect to the underlying probability distribution. 
For seeing the difference, consider a target concept class C over some fixed 
learning domain X containing the concept X . Furthermore, suppose X to 
be learnable from positive data only. Thus, it is reasonable to assume that all 
negative examples have probability 0 . Consequently, a PAC learner could simply 
output a finite description for X independent of the particular c G C to be 
learned. It would even learn from 0 examples without any error (where error is 
measured as usual in the PAC model). But this may be hardly what one really 
wants to learn. In contrast, a stochastic finite learner would always output a 
hypothesis h such that c = h with any desired probability. Last but not least, in 
the PAC model the sample complexity depends exclusively on the VC dimension 
of the target concept class and the error and confidence parameters e and S , 
respectively (cf. [2]). In contrast, a stochastic finite learner decides itself how 
many examples it wishes to read. 

Finally, if the limit learner M is additionally known to be conservative 
and rearrangement independent much more efficient stochastic finite learners 
are available. A learner is said to be rearrangement independent if its output de- 
pends exclusively on the range and length of its input (cf. [12] and the references 
therein). Furthermore, a learner is conservative if it exclusively performs mind 
changes that can be justified by an inconsistency of the abandoned hypothesis 
with the data received so far (cf. [1] for a formal definition). In this case, we 
could even prove exponentially shrinking tail bounds for E[CONV] . 
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Theorem 1. (TTossmanith and Zeugmann [19] .j Let CONV be the sample 
complexity of a conservative and rearrangement-independent learning algorithm. 
Then 

Vr{CONV) > 2t ■ E[CONV]) < 2~^ for alH G N . 

Having this theorem, the sample complexity of the resulting stochastic fi- 
nite learner is only by a factor of log(l/(5) larger than that of the original 
limit learner. Consequently, Theorem 1 puts the importance of conservative and 
rearrangement-independent learners into the right perspective. As long as the 
learnability of indexed families is concerned, these results have a wide range of 
potential applications, since every conservative learner can be transformed into 
a learner that is both conservative and rearrangement-independent provided the 
hypothesis space is appropriately chosen (cf. Lange and Zeugmann [12]). 

Since the distribution of CONV decreases geometrically, all higher moments 
of CONV exists, too. Thus, instead of applying Theorem 1 directly, one can 
hope for further improvements by applying even sharper tail bounds using for 
example Chebyshev’s inequality. 

Our model of stochastic finite learning differs to a certain extent from the 
PAC model. First, it is not completely distribution independent, since a bit 
of additional knowledge concerning the underlying probability distributions is 
required. Thus, from that perspective, stochastic finite learning is weaker than 
the PAC model. But with respect to the quality of its hypotheses, it is stronger 
than the PAC model by requiring the output to be probably exactly correct rather 
than probably approximately correct. That is, we do not measure the quality of 
the hypothesis with respect to the underlying probability distribution. Note that 
exact identification with high confidence has been considered within the PAC 
paradigm, too (cf., e.g., Goldman et al. [7]). 

Furthermore, in the uniform PAC model as introduced in Valiant [20] the 
sample complexity depends exclusively on the VC dimension of the target con- 
cept class and the error and confidence parameters e and 6 , respectively. This 
model has been generalized by allowing the sample size to depend on the concept 
complexity, too (cf., e.g., Blumer et al. [2] and Haussler et al. [9]). Provided no 
upper bound for the concept complexity of the target concept is given, such PAC 
learners decide themselves how many examples they wishes to read (cf. [9]). This 
feature is also adopted to our setting of stochastic finite learning. However, all 
variants of PAC learning we are aware of require that all hypotheses from the 
relevant hypothesis space are uniformly polynomially evaluable. Though this re- 
quirement may be necessary in some cases to achieve (efficient) stochastic finite 
learning, it is not necessary in general as we shall see below. 



4 Learning Conjunctive Concepts 

In this section we exemplify the general approach outlined above by using the 
class of all concepts describable by a monomial. For all details omitted the reader 
is referred to Reischuk and Zeugmann [17]. 
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To define the classes of concepts we deal with in this paper let £„ = 
{xi, xi, X 2 , X 2 ■ ■ ■ , Xn, Xn} be a set of literals. Xi is a positive literal and Xi 
a negative one. A conjunction of literals defines a monomial. For a monomial 
m let #(m) denote its length, that is the number of literals in it. 

m describes a subset L{m) of , in other words a concept, in the obvious 
way: the concept contains exactly those binary vectors for which the monomial 
evaluates to 1, that is L{m) := {6 G | m{b) = 1} . The collection of objects 
we are going to learn is the set of all concepts that are describable by 
monomials over . There are two trivial concepts, the empty subset and 
itself. Xn , which will also be called “TRUE” , can be represented by the empty 
monomial. The concept “FALSE” has several descriptions. To avoid ambiguity, 
we always represent “FALSE” by the monomial x\Xi . . . XnXn ■ Furthermore, we 
often identify the set of all monomials over £„ and the concept class C„ . Note 
that |C„ I = 3" + 1 . 

For the concept class C„ we choose as hypothesis space the set of all monomi- 
als over Cn ■ We shall distinguish learning from positive data only and learning 
from both positive and negative data. Note that when considering learning from 
positive data only, one cannot decide whether or not the learner has already 
converged. When learning from positive and negative data is considered, the 
stage of convergence is decidable, but one would have to read the data sequence 
until all Boolean vectors did appear. Thus, for any interesting n decidability is 
practically infeasible. 

The learner used is Haussler’s [8] Wholist algorithm. Note that this learner 
computes a new hypothesis by using only the most recent example received and 
his old hypothesis. Such learners are called iterative (cf. [13]). For the sake of 
completeness, we include the Wholist learning algorithm here. By data{c) we 
denote any data sequence for the concept c . 



Algorithm P: 

On input sequence ((&i, c(&i)), (62, c(&2))) • • •) do the following: 

Initialize /iq := x\Xi . . . XnX„ ■ 
for z = 1, 2, . . . do 

let hi-i denote "P’s internal hypothesis produced before receiving (&j,c(6 i)); 
when receiving {bi,c{bi)) test whether or not hi-i{bi) = c{bi) . 
if hi-i{bi) = c{hi) then set hi := hi-i and output hi . 

else for j := 1 to n do 

if 6^ = 1 then delete xj in hi-i else delete Xj in hi-i ; 
let hi be the resulting monomial and output hi . 

end. 



Now, it is easy to see that Algorithm P learns the set of all monomials 
in the limit. If the target monomial is the concept “FALSE”, then Algorithm 
T’ immediately converges. Thus, we call “FALSE” the minimal concept. If the 
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target concept contains precisely n literals, then one positive example suffices. 
This positive example is unique. 

In the general case, we call the literals in a monomial m relevant. All 
other literals in £„ are said to be irrelevant for to. There are 2n — ^{m) 
irrelevant literals. We call bit i relevant for to if Xi or Xi is relevant for to . 
By k = k{m) = n — we denote the number of irrelevant bits. 

First, we consider learning from positive data. To avoid the trivial cases, we 
let c = L{m) G Cn be a concept with monomial to = such that 

k = k{m) = n — ^{m) > 0 . There are 2^ positive examples for c . For the 
sake of presentation, we assume these examples to be binomially distributed. 
That is, in a random positive example all entries corresponding to irrelevant bits 
are selected independently of each other. With some probability p this will be 
a 1 , and with probability q := 1 — p a 0 . We shall consider only nontrivial 
distributions where 0 < p < 1 . Note that otherwise the data sequence does 
not contain all positive examples. We aim to compute the expected number 
of examples taken by V until convergence. Again we use CONV to denote a 
random variable counting the number of examples read till convergence. 

The first example received forces P to delete precisely n of the 2n literals 
in h[j . Thus, this example always plays a special role. Note that the resulting 
hypothesis hi depends on b \ , but the number k of literals that remain to 
be deleted from hi until convergence is independent of bi . Using tail bound 
techniques, we can show the following theorem. 



Theorem 2. Let c = L{m) he a non-minimal concept in C„ , and let the 
positive examples for c be binomially distributed with parameter p . Define 
if := min{Y^, . Then the expected number of positive examples needed by al- 
gorithm V until convergence can he bounded by E[CONV] < [log,^ k(rn)~\ +3 . 



Proof. Let k := k{m) . The first positive example contains v times a 1 and 
k — V many 0 with probability {^)p'^q^~'^ at the positions not corresponding 
to a literal in the target monomial to . Now, assuming any such vector, we easily 
see that hi contains v positive irrelevant literals and k — v negative literals. 
Therefore, in order to achieve convergence, the algorithm V now needs positive 
examples that contain at least one 0 for each positive irrelevant literal and at 
least one 1 for each negative irrelevant literal. Thus, the probability that at 
least one irrelevant literal survives p subsequent positive examples is bounded 
by {k — v)p^ . Therefore, 



Pr[C'OAU-l > p\ < ^(^\p''q'^-'' ■ {vq^^ + {k-v)p^^) . 

iy=0 



Next, we derive a closed formula for the sum given above. 



Claim 1. 



p’^q'^-’^ .v = kp 



and ^ 



iy—0 ^ k'—O 

The first equality can be shown as follows. 



p'^q'^-' ■{k-v) = kq 
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i: 



k-l 



1/—1 
k-l 



1^=0 






zv=0 



= k p - {p-\- q) = k p . 



The other equality can be proved analogously, which yields Claim 1. 
Now, proceeding as above, we obtain 



E[CONV - ^] = J2 P^iCONV -I > ^i] 



fi—0 



fi=\ 1^=0 



= a+EE 

^—X u—O 






fi—X iy—0 



= a+E'^'-E 

fi=X I'—O 



pV^+Ep^'E 

fi—X I'—O 



p^q^-^ • (fc - 0 



—kp by Claim 1 



—kq by Claim 1 



= \ + kp■'^q^^ + kq■'^p^^ = X + k-[q^+p^] 

fi=X p—X 

< X + k2 tp~^ . 

Finally, choosing A = [log,^ k~\ gives the statement of the theorem. □ 

Now, we can easily determine the expected total learning time. 



Corollary 1. For every binomially distributed text with parameter 0 < p < 1 
the average total learning time of algorithm V for concepts in Cn with p literals 
is at most 0(n(log(n — p + 2)) . 

The Wholist algorithm possesses the two favorable properties needed to apply 
Theorem 1, i.e., it is rearrangement-independent and conservative. Thus, we can 
conclude 

Pr[C'ONF >2t- E[CONV] < 2"‘ for all t G N . 
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Next, we turn our attention to the design of a stochastic finite learner. We 
study the case that the positive examples are binomially distributed with param- 
eter p . But we do not require precise knowledge about the underlying distribu- 
tion. Instead, we reasonably assume that prior knowledge is provided by param- 
eters Plow and Pup such that piow ^ P ^ Pup for the true parameter p . Bino- 
mial distributions fulfilling this requirement are called (piowtPup) admissible 
distributions. Let T>n[piow>Pup\ denote the set of such distributions on . 

If bounds Plow and Pup are available, the Wholist algorithm can be trans- 
formed into a stochastic finite learner inferring all concepts with high confidence. 

Theorem 3. Let 0 < piow < Pun < 1 and ih := minj^-N — , — \ Then 

^ ^ j - — Plow Pup 

{Cn,'Dn[piowTPup\) is stochasUcally finitely learnahle with high confidence from 
positive presentations. To achieve 5 -confidence no more than 0(log2 l/<5-log^n) 
many examples are necessary. 

Proof. The learner is based on the Wholist algorithm and a counter for the 
number of examples already processed. If the Wholist algorithm is run for r := 
[log^ n] -I- 3 many examples Theorem 2 implies that t is at least as large as 
the expected convergence stage E[CONV] . 

In order to achieve the desired confidence, the learner sets 7 := [log |] 
and runs the Wholist algorithm for a total of 2 7 • r examples. The algorithm 
outputs the last hypothesis /12 -y-r produced by the Wholist algorithm and stops 
thereafter. The reliability follows from the tail bounds established in Theorem 1. 

□ 



4.1 Average-Case Analysis for Learning 
in the Limit from Informant 

Finally, let us consider how the results obtained so far translate to the case of 
learning from informant. Since the Wholist algorithm does not learn anything 
from negative examples, one may expect that it behaves much poorer in this 
setting. Let us first investigate the uniform distribution over . Again, we 
have the trivial cases that the target concept is “FALSE” or to is a monomial 
without irrelevant bits. In the first case, no example is needed at all, while in the 
latter one, there is only one positive example having probability 2“” . Thus the 
expected number of examples needed until successful learning is 2" = 2"^^’”) . 

Theorem 4. Let c = L{m) € Cn be a nontrivial concept. Lf a data sequence 
for c is generated from the uniform distribution on the learning domain by 
independent draws the expected number of examples needed by algorithm P until 
convergence is bounded by 

E[CONV] < 2#(™) ([log2fc(TO)l -k3) . 

Proof. Let CONV-\- be a random variable for the number of positive exam- 
ples needed until convergence. Every positive example is preceded by a possibly 
empty block of negative examples. Thus, we can partition the initial segment 
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of any randomly drawn informant read until convergence into CONV+ many 
blocks Bj containing a certain number of negative examples followed by pre- 
cisely one positive example. Let Aj be a random variable for the length of block 

Bj . Then CONV = Ai+A 2 ~\ \- Aconv+ i where the Aj are independently 

identically distributed. In order to compute the distribution of Aj , it suffices 
to calculate the probabilities to draw a negative and a positive example, respec- 
tively. Since the overall number of positive examples for c is 2^ with k = k{m) , 
the probability to generate a positive example is . Hence, the probability 
to draw a negative example is 1 — . Consequently, 

Fr[Aj=fx+l] = A - 2'=-”) ^ • 2'=-” . 



Therefore, 

E[CONV] = E[Ai + A 2 + ■ ■ ■ + Aconv+\ 

00 

= ^ E[Ai -h H2 -h • • • -h H J CONV+ = C] • Vr[CONV+ = Q 
C=o 

00 

= ^ c • E[Ai] ■ Vr[CONV+ = C] 

C=o 

= E[CONV+] ■ E[Ai] 

By Markov’s inequality, we have E\CONV +] < |"log 2 fc] -1-3 , and thus it remains 
to estimate . A simple calculation shows 

Lemma 1. For every 0 < a < 1 holds: 

00 

f-L—0 

Using this estimation we can conclude 

00 

E[Ai] = + 1) • Pr[Ai = /X -I- 1] 

0 

00 . s ^ 

= 2'=-" ^(/x -hi)- (1-2'=-” I = 2”-'= , 

and thus the theorem follows. □ 

Hence, as long as the length of m is constant, and therefore k{m) = n— 0(1) , 
we still achieve an expected total learning time of order nlogn . But if #(m) 
grows linearly the expected total learning becomes exponential. On the other 
hand, if there are many relevant literals then even may be considered as 
a not too bad approximation for c . Consequently, it is natural at this point 
to introduce an error parameter e G (0, 1) as in the PAC model, and to ask 
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whether one can achieve an expected sample complexity for computing an £- 
approximation that is bounded by a function depending on log n and 1/e . 

To answer this question, let us formally define errorm{hj) = D{L{hj) A 
L(m)) to be the error made by hypothesis hj with respect to monomial m . Here 
L{hj)AL{m) stands for the symmetric difference of L{hj) and L{m) and D for 
the underlying probability distribution with respect to which the examples are 
drawn. Note that by Theorem 1 we can conclude error m{hj) = D{L{m)\L{hj)) . 
We call hj an e —approximation for m if error m{hj) ^ e . Finally, we redefine 
the stage of convergence. Let d = {dj)j^fj+ be an informant for L{m) , then 

CONV,{d) =df 

the least number j such that error m{P{d[i])) < s for all i > j. 

Note that once the Wholist algorithm has reached an e -approximate hypothesis 
all further hypotheses will also be at least that close to the target monomial. 

The following theorem gives an affirmative answer to the question posed 
above. 

Theorem 5. Let c = L{m) € Cn be a nontrivial concept. Assuming that exam- 
ples are drawn independently from the uniform distribution, the expected number 
of examples needed by algorithm P until converging to an e -approximation for 
c can be bounded by 

E[CONVe] < - ■ (|"log 2 k{m)~\ -\- 3) . 

Proof. It holds error jn{ho) = , since /iq misclassifies exactly the posi- 

tive examples. Therefore, if errorm{ho) ^ £ , we are already done. Now suppose 
errorm{ho) > e. Consequently, l/£ > , and thus the bound stated in 

the theorem is larger than 2"“^(’”)(|"log2 k{m)~\ -\- 3) , which, by Theorem 4 is 
the expected number of examples needed until convergence to a correct hypoth- 
esis. □ 

Thus, additional knowledge concerning the underlying probability distri- 
bution pays off again. Applying Theorem 1 and modifying the stochastic fi- 
nite learner presented above mutatis mutandis, we get a learner identifying e- 
approximations for all concepts in C„ stochastically with high confidence using 
0(i • log J • logn) many examples. Comparing this bound with the sample com- 
plexity given in the PAC model, we see that it is reduced exponentially, i.e., 
instead of a factor n now we have the factor log n . 

Finally, we can generalize the last results to the case that the data sequences 
are binomially distributed for some parameter p G (0, 1) . This means that 
any particular vector containing v times a 1 and n — v & Q has probability 
p'^(l — since a 1 is drawn with probability p and a 0 with probability 

I — p . First, Theorem 4 generalizes as follows. 

Theorem 6. Let c = L{m) G C„ be a nontrivial concept. Let m contain pre- 
cisely 7T positive literals and r negative literals. Lfthe labeled examples for c are 
independently binomially distributed with parameter p and if := min{Yir^, 
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then the expected number of examples needed by algorithm "P until convergence 
can be bounded by 

E[CONV] < (^riog^fc(m)l +3^ . 

Proof. Assuming the same notation as in the proof of Theorem 4, it is easy to 
see that we only have to recompute E[Ai] , and thus Pr[Ai = ^ + 1] , too. Since 
m contains precisely tt positive literals and r negative literals, the probability 
to draw a positive example is clearly p'^{l — pY , and thus the probability to 
randomly draw a negative example is 1 — p^ (1 — pY ■ Consequently, 

Pr[Ai=/r+l] = Y-pYI-pYY-pYI-pY, 

and Lemma 1 gives A[Ai] = . □ 

Theorem 5 directly translates into the setting of binomially distributed in- 
puts, too. 

Theorem 7. Let c = L{m) G C„ be a nontrivial concept. Assume that the 
examples are drawn with respect to a binomial distribution with parameter p , 
and let if = min|Y^, . Then the expected number of examples needed by 
algorithm P until converging to an s -approximation for c can be bounded by 

E[CONV] < i • ([log,^ fc(m)] -I- 3) . 

Finally, one can also get stochastic finite approximations with high confidence 
from informant with an exponentially smaller sample complexity. 

Theorem 8. Let 0 < piow < Puv < 1 and Y := mini — , -^| . For 
{Cm'DYpiowiPup]) £ -approximations are stochastically finitely learnable with S - 
confidence from informant for any e, <5 G (0, 1) . 

For this purpose, 0(^ Tog 2 l/<5 • log,^ n) many examples suffice. 

5 Learning the Pattern Languages 

Finally, we take a look at the class of all pattern languages. The results presented 
below have been obtained by Rossmanith and Zeugmann [19] and we refer the 
reader to this paper for all omitted details. 

This class is not PAC learnable with respect to every hypothesis space having 
a membership problem uniformly decidable in polynomial time. 

Following Angluin [1] we define patterns and pattern languages as follows. 
Let A = {0, 1, ... } be any non-empty finite alphabet containing at least two 
elements. By A* we denote the free monoid over A . The set of all finite non- 
null strings of symbols from A is denoted by A^ , i.e., A^ = A* \ {A} , where 
A denotes the empty string. Let X = {xi\ f G N} be an infinite set of vari- 
ables such that AC\ X = % . Patterns are non-empty strings over AU X , e.g.. 
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01, Oxolll, lxoxo0a;ia:2a;o are patterns. The length of a string s G and of 
a pattern tt is denoted by |s| and |7 t| , respectively. A pattern tt is in canon- 
ical form provided that if k is the number of different variables in tt then the 
variables occurring in tt are precisely a;o, . . ■ , Xk~i ■ Moreover, for every j with 
0 ^ j < fc — 1 , the leftmost occurrence of Xj in tt is left to the leftmost occur- 
rence of Xj+i . The examples given above are patterns in canonical form. In the 
sequel we assume, without loss of generality, that all patterns are in canonical 
form. By Pat we denote the set of all patterns in canonical form. 

If k is the number of different variables in tt then we refer to tt as to 
a k -variable pattern. By Path we denote the set of all k -variable patterns. 
Furthermore, let tt G Path, and let UQ,...,Uk-i G ; then we denote by 
7r[a;o/uo, ■ • ■ ) a^fc-i/ufe-i] the string w G obtained by substituting Uj for 
each occurrence of Xj , j = 0, . . . , A: — 1 , in the pattern tt . For example, let 
TT = OxqIxiXo . Then 7r[a;o/10, xi/Ol] = 01010110 . The tuple {uq, . . . , Uk-i) is 
called a substitution. Furthermore, if |uol = ‘ ‘ ‘ = Iwfe-il = 1 , then we refer to 
(uo, . . . , Uk-i) as to a shortest substitution. Let tt G Patk ; we define the language 
generated by pattern tt by L{tt) = {7r[a:o/uo) • ■ • , | uo,...,Uk-i G 

A'^} . By PAT k we denote the set of all k -variable pattern languages. Finally, 
PAT = UfceN k denotes the set of all pattern languages over A . 

Our pattern learner will directly output patterns as hypotheses. Note that 
this hypothesis space is not admissible for PAG learning unless V = AfV . It uses 
Lange and Wiehagen’s [11] pattern language learner as a main ingredient. We 
consider here learning from positive data only. Again, first an average-case analy- 
sis has to be performed. Every string of a particular pattern language is generated 
by at least one substitution. Therefore, it is convenient to consider probability 
distributions over the set of all possible substitutions. That is, if tt G Patk ? then 
it suffices to consider any probability distribution D over A'*’ x • • • x A'^ . For 

fc— times 

(uq, . . . , Uk-i) G A^ X • • • X A^ we denote by D{uq, . . . , Uk-i) the probability 
that variable xq is substituted by uq , variable xi is substituted by ui , . . . , 
and variable Xk-i is substituted by Uk-i . 

In particular, we mainly consider a special class of distributions, i.e., prod- 
uct distributions. Let k G N+ , then the class of all product distributions for 
Patk is defined as follows. For each variable Xj , O^j^k— 1, we assume an 
arbitrary probability distribution Dj over A~^ on substitution strings. Then 
we call D = Dq x • • • x Dk-i product distribution over A'^ x • • • x A~^ , i.e., 
D{uo, . . .,Uk-i) = rii=d D j{uj) . Moreover, we call a product distribution reg- 
ular if Dq = • • • = Dk-i . Throughout this paper, we restrict ourselves to deal 
with regular distributions. We therefore use d to denote the distribution over 
A'*" on substitution strings, i.e, D{uq, . . . , Uk-i) = 0^=0 d{uj) . We call a regu- 
lar distribution admissible if d{a) > 0 for at least two different elements a € A . 

We will express all estimates with the help of the following parameters: E[A\ , 
a and [3 , where A is a random variable for the length of the examples drawn, a 
and [3 are defined below. To get concrete bounds for a concrete implementation 
one has to obtain c from the algorithm and has to compute A[A] , a , and [3 
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from the admissible probability distribution D . Let uq,.. . ,Uk-i be indepen- 
dent random variables with distribution d for substitution strings. Whenever 
the index i of Ui does not matter, we simply write u or u' . 

The two parameters a and [3 are now defined via d . First, a is simply the 
probability that u has length 1, i.e., 

a = Prdul = 1) = ^ d{a). 

aeA 

Second, f3 is the conditional probability that two random strings that get sub- 
stituted into 7T are identical under the condition that both have length 1 , i.e., 

/3 = Pr(w = m' I |w| = |m'| = l) = X! • 

a^calA a^A. 

Then we can prove the following bound for the stage of convergence for every 
target pattern from Patk ■ 

Theorem 9. E[CONV] = O • log;^/^(A:)^ for all k ^ 2 . 

The expected total learning time can be estimated by O logi/^(A:) 

for all k ^ 2 . 

The transformation of Lange and Wiehagen’s [11] pattern language learner 
into a stochastic finite learner is the performed mutatis mutandis. The prior 
knowledge needed is now assumed concerning a lower bound for a and an upper 
bound for j3 . More precisely, we have the following theorem. 

Theorem 10. Let a*, /3* G (Oj 1) • Assume T> to he a class of admissible 
probability distributions over A~^ such that a ^ a* , ^ /3* and E[A\ finite 

for all distributions D € V . Then (PAT,!)) is stochastically finitely leamable 
with high confidence from text. 

The latter theorem allows a nice corollary which we state next. Making the 
same assumption as done by Kearns and Pitt [10], i.e., assuming the additional 
prior knowledge that the target pattern belongs to Patk , the complexity of the 
stochastic finite learner given above can be considerably improved. The resulting 
learning time is linear in the expected string length, and the constant depending 
on k grows only exponentially in k in contrast to the doubly exponentially 
growing constant in Kearns and Pitt’s [10] algorithm. Moreover, in contrast 
to their learner, our algorithm learns from positive data only, and outputs a 
hypothesis that is correct for the target language with high probability. 

Again, for the sake of presentation we shall assume k ^ 2 . Moreover, if the 
prior knowledge k = I is available, then there is also a much better stochastic 
finite learner for PAT i (cf. [18]). 

Corollary 2. Let a*, /3* G (0, 1) . Assume T> to he a class of admissible prob- 
ability distributions over such that a a^. , /3 ^ /3* and E[A\ finite for 
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all distributions D G T> . Furthermore, let k ^ 2 be arbitrarily fixed. Then there 
exists a learner M such that 

(1) M learns {PATk,T>) stochastically finitely with high confidence from text, 
and 

(2) The running time of M is (A) log2(l/5)) . 

{* Note that dj and log 2 /^^(fc) now are constants. *) 

Further results have been obtained for the class of all one-variable pattern 
languages. Note that even this class is not PAC learnable as shown in [14]. For 
the class of all one-variable pattern languages our stochastic finite learner has a 
total learning time that is essentially linear in the length of the target pattern 
(cf. [16,18]). Moreover, this stochastic finite learner works for all meaningful 
probability distributions. 

These results provide at least some evidence that our new approach may lead 
to a new generation of learning algorithms that combine a feasible total learning 
time and proven performance guarantees. 
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Abstract. Sequential sampling algorithms have recently attracted 
interest as a way to design scalable algorithms for Data mining and 
KDD processes. In this paper, we identify an elementary sequential 
sampling task (estimation from examples), from which one can derive 
many other tasks appearing in practice. We present a generic algorithm 
to solve this task and an analysis of its correctness and running time 
that is simpler and more intuitive than those existing in the literature. 
For two specific tasks, frequency and advantage estimation, we derive 
lower bounds on running time in addition to the general upper bounds. 

Keywords: Random sampling, sequential sampling, adaptive sampling, 
Chernoff bounds. Data mining. 



1 Introduction 

It is a universal phenomenon that data repositories in all domains are growing 
at an exponential rate. In many applications, the amount of available input 
data is already so large that not all data can be processed, maybe not even read, 
within reasonable time. Random sampling is one possible strategy that algorithm 
designers can use to deal with on massive amounts of data. 

Typically, a sampling algorithm obtains a random sample from a large 
database, performs some computation on the sample, and hopes that the re- 
sult is not too different from what would be obtained by computing over the 
whole database. Of course, for some computation tasks this approach fails com- 
pletely; either the result is required to be exact, or there is no way of obtaining 
even an approximate answer without looking at all data. Often, however, an 
approximate answer is acceptable, and can be obtained with some confidence by 
looking at samples of reasonable size. 
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To make the discussion concrete, we assume from now on that the task is that 
of estimating some quantity /i implicit in the database. As an easiest example, 
/i could be the fraction of database records having a certain property. Then 
our task is to find some approximation /t of /x up to e that is correct with 
probability at least 1 — (5 , for given parameters e and 6 . Depending on the 
application, we may require absolute approximation, that is, ^—e < ft < ji+e , or 
relative one, that is, (1 — e)/i < /t < (l + e)/i . We call this type of approximation 
(especially, the relative one) a (e, 5) -approximation of /i . 

In these situations, large deviation bounds (such as Chernoff or Hoeffding 
type bounds, or those given by the Central Limit Theorem) may be used to 
determine a sample size that guarantees such approximations. These bounds 
have often been used, more or less rigorously, to support sampling strategies in 
different domains, such as machine learning and data mining [3,6,7,12,17,18]. 
In some scenarios, remarkable speedup has been reported, sometimes by orders 
of magnitude. In other works, however, researchers have considered sampling 
either impractical, or difficult to use. There are, in our opinion, two causes for 
the negative conclusions: 

(1) Hoeffding (additive) type bounds are worst case, in the sense that they ap- 
ply to all parameter settings. Generally this leads to prohibitive sample size. 
Furthermore, Hoeffding bounds can only be applied for absolute, not relative 
approximation. 

(2) Chernoff (multiplicative) type bounds provide relative approximations and 
smaller bounds, but cannot be applied to our problem because they depend 
on the unknown quantity to be estimated. To make this point clear, suppose 
that pL is simply the frequency of some event in the database. Chernoff bound 
states that sample size 0{{e^p)~^ ■ ln(<5“^)) suffices for obtaining a relative 
(e, 6) -approximation of /i . Since /i is the unknown, it is not clear how to 
apply this formula to determine sample size. 

Sequential sampling algorithms try to overcome these difficulties in the fol- 
lowing way. Instead of obtaining the whole sample from the start, they collect 
database records sequentially; from time to time (in the extreme case, at every 
step) they decide whether they have already seen enough records to reach a safe 
conclusion, and if so, they stop. Since, in practice, we are not often in the worst- 
case situation, these algorithms usually stop much before their worst-case bound. 
They have also been called adaptive [4,5] because their performance adapts to 
the complexity of the particular problem instance being solved. 

Several researchers in the machine learning and the KDD communities have 
used heuristics based on stopping sampling when some conclusion seems very 
likely [12,17,18]. On the other hand, statisticians have worked extensively on 
sequential sampling since the 40s (Wald’s work is classical [22]; see, e.g., [9] for 
recent works). But they focused mostly on minimizing the number of experiments 
for hypothesis testing. 

Only very recently, some sequential sampling algorithms have been developed 
that (i) can be applied to current, general, KDD tasks and that (ii) have theoret- 
ical guarantees of correctness and performance [4,5,19,20,21]. In these works, the 
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algorithms have been experimentally tested against those using non-sequential 
sampling, or no sampling at all, and have systematically outperformed them. 
Typically, the performance of these algorithms depends on the quality of the 
data rather than the size of the database; so their advantage is expected to be 
more apparent as databases grow. 

We thus advocate for trying to apply sequential sampling algorithms in more 
and more situations. This is difficulted by at least two factors. First, the analyses 
in the literature (e.g., [5,20,21]) are somewhat complicated and for the particular 
form of the task that is being solved, instead of reflecting clearly the intuition 
of why adaptive sampling works; thus, when a new task has to be solved most 
of the analysis has to be done again. Second, there are few theoretical tools 
to determine how good a given sequential sampling algorithm is for some task, 
that is, whether it is about as fast as possible or whether a large improvement 
is possible by more clever design; but see [2] for an exception. 

This paper contributes to these issues as follows: 

1. We isolate a basic sequential sampling task from which many others can be 
derived; we present a very generic sequential estimator and provide an analysis of 
the correctness and efficiency of this procedure much more intuitive and simple 
than those in the literature (e.g., [5,20,21]). The analysis is given in an almost 
“recipe” form for the use of algorithm designers. 

2. We present a couple of case studies (already discussed in the literature) 
showing the conciseness of analyses obtained this way. 

3. We show how this analysis technique may lead to lower bounds on the run- 
ning time of some sequential sampling techniques. In particular, we prove lower 
bounds for the running time for (e, <5) -approximation in so-called frequency es- 
timation and advantage estimation problems; these lower bounds match (up to 
doubly log factors) the upper bounds given by our implementations of the generic 
estimator, which therefore are almost optimal. 

2 Previous Work 

We review in this section some existing work on sequential sampling applied to 
Knowledge Discovery in databases. Of course, other fields of computer science 
have used sampling extensively, too; see, e.g., [2]. 

Example 1. Lipton et al. [14,15] provide algorithms for estimating the size of 
queries to the database. Their core algorithm is a sequential sampling method 
for estimating the proportion p of records in a database that satisfy a certain 
property. That is, the task is essentially estimating a probability p that a certain 
event occurs in a sequence of i.i.d. Bernoulli trials, a task we call frequency 
estimation. The target p is the probability p in this frequency estimation. Their 
algorithm achieves relative (e, <5) -approximation in time 0(l/(£^/i) ln(l/<5)) . 

Example 2. In [4,5] we presented an algorithm for a task appearing often in 
the context of supervised learning: advantage estimation for a Boolean classifier. 
A boolean classifier predicts a 0/1 value on any example. The advantage 7 of a 
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Boolean classifier is its classification accuracy on a set of examples minus 1/2 , 
i.e., how much better it predicts than random guessing. The target /i is this 
advantage 7 (of a given classifier) in advantage estimation. The algorithm in [5] 
achieves relative (e,<5) -approximation in time 0{s ■ (Inins -|- ln(l/(5))) , where 

5 = 1/(£V") • 

Example 3. In [5], the algorithm in [4] is generalized to approximating any 
/i that can be obtained applying a smooth function to the average of bounded 
variables. The algorithm is applied to the hypothesis selection problem: selecting 
a hypothesis from a set H with utility fj, that is at most within (l-l-£) factor 
of the best one. Its running time is 0{s ■ (Inins -I- ln(|i7|/(5)) , where s is as 
above. 

Example 4. Scheffer and Wrobel [20,21] observed that similar techniques work 
for any p, for which one can prove deviation bounds that tend to 0 as sample size 
grows. They use it for the n -best hypotheses selection task, i.e., for given e and 

6 , select n hypotheses from H such that none of the selected hypothesis is more 

than (absolute) e worse than any discarded hypothesis. The running time of 
their algorithm is in ln(l/<5)) , but note that it achieves absolute rather 

than relative approximation. Additionally, their algorithm includes Hoeffding 
races [17] to discard soon hypothesis that are unlikely to be among the n best. 

3 Our Framework and a Generic Estimator Algorithm 

We describe in this section a basic task that we want to solve by means of 
sequential sampling. 

We want to estimate some unknown quantity p , of which we know only 
that 0 < p < I . To do this, we have available an infinite sequence of examples 
X \ , X 2 , • • ■ , drawn independently from example space X and some function 
F : X* !->• R such that, for any t , E[F{Xi , . . . , Xt)] = p . 

An estimator algorithm A performs as follows: At the start, it reads input 
parameter 6 . Then at each (time-)step t > 1 , it reads the value of Xt and 
produces a pair of real numbers {pt,£t) ■ Note that, strictly speaking, A is not 
an algorithm because it never stops. 

We say that A is an estimator for p if the following condition holds for all 
inputs 5 , 

(a) with probability at least 1 — <5 , it holds for all t that 

{l-£t)p < Pt < {l-\-£t)p, and 

(b) the sequence of £* tends to 0 as t grows. 

For condition (a), we should be a bit careful on the order of quantification of S 
and t . It requires that only with probability less than S , there may be some 
step where approximation is not good. 

We are interested in efficient estimator algorithms, for which the quantity St 
tends to 0 as quickly as possible. Quite naturally, an estimator algorithm for p 
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can be used to obtain an (e, S) -approximation to /i if both e and 6 are given: 
It is enough to run the estimator with parameter S , monitor its output until St 
goes below e , and then produce fit as the estimation. 

In this paper, we consider one generic estimator algorithm, which is stated 
in Figure 3. This algorithm is quite straightforward; it just computes fit by 
using the function F (that is given when specifying the problem) for observed 
set of examples Xi,...,Xt . It is so natural that one may not be able to think 
of any other algorithm for solving the type of estimation problem that we are 
considering. Below we will show “a recipe” to prove the correctness and analyze 
the running time of this algorithm, depending on the convergence properties of 
the estimations for /i . We consider only relative approximation, but, if necessary, 
it is routine to develop another algorithm for absolute approximation. This would 
correspond to the algorithms in [20,21]. 



procedure GenericEstimator 
input 5 ; 

t := 0; 50 := 0; 
while (true) do 
t := t+1-, 

St := 5t-iU{Xt}; 
compute fit ■= F{St) 

compute £t (by some formula obtained in the analysis); 
output {fit,et)-, 

end-while 

Fig. 1. The Generic Estimator Algorithm 



3.1 Implementation and Analysis 

Here we explain how to implement our generic algorithm for solving a given 
estimation problem and how to prove its correctness / efficiency for that problem. 

Step 1. Review the literature on large deviation bounds and find a good bound 
B for which we can prove that, for all a and t , 

Pr[ (1 — q;)/x < fit < (l + a)^j > 1 — B{a, fi,t, . . .). 

Let us just write B{a,fi,t) even though B may depend on other parameters 
(such as Var(^) , or even St itself). 

Step 2. Observe that for the sequence at defined by 

B{at,fi,t) = S/t{t+l), 

the total probability that there is an error at some t is below 6 . This is because 
Pr[ 3t : -■[ (1 — a)fi < fit < {^ + a)fi ] ] 

< 

t>l t>l ^ ’ 
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Unfortunately, this definition of at depends on unknown /i , so we cannot simply 
define St at in the algorithm. But we can use some kind of “bootstrapping” 
approach: if we assume that jit is quite near /i , plugging it in the equation 
instead of ^ should give good enough results. 

Step 3. Assume that statement (1) holds: 

Vt : {l-at)p, < p,t < (l + at)/r. (1) 

Assuming that B is decreasing in /i , we have 

B{at,p,t/{^ + oct),t) > (B{at,p,,t) = ) S/t{t+l), 

and, intuitively, this inequality should be quite tight. Assuming that B is also 
decreasing in a (which is most often the case), solve this inequation for at , 
and obtain an upper bound for at in terms of /r* and t (and possibly other 
parameters), which we denote by B'{fj,t,t) . Define St to be that upper bound; 
that is, 

at < B'{^it,t) = St. ( 2 ) 

Step 4 (Conclusion). With probability at least 1 — i5 , the bound (1) holds for 
any t > 1 . Then, so holds bound (2) (for any t > 1 ), and therefore 

Vt : (l-et)pi < pit < {l + et)pi. (3) 

Also, for reasonable bounds B , it should be immediate to check that the se- 
quence £t tends to 0 . Then we have proved that the algorithm, with this defi- 
nition of £t , is an estimator for p, . 



3.2 Efficiency 

For given e and d , we can use our algorithm to obtain an (e, S) -approximation 
of p . That is, we run the algorithm until £t becomes less than e . The following 
procedure can be used to estimate the time used by the algorithm. 

(1) We have a definition of St using pt and t . 

(2) We also have upper and lower bounds for pt in terms of p , £t , and t , that 
hold with probability at least 1 — (5 . 

(3) Solve for £t in these two equations. This gives bounds for £t in terms of p 
and t that hold with probability 1 — 5 . 

(4) Now demand that £t is at most e , and solve for t . This gives a bound on 
the time t required/sufficient to reach e* < £ with probability 1 — i5 . 

An algorithmic trick to improve this bound on convergence time was intro- 
duced in [5]. The idea is to produce a new pair {pt, St) not at each iteration, but 
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update the estimation only at time steps t of the form t = [s*J , for some con- 
stant s > 1 . Equivalently, we may view this as follows: the algorithm is not tak- 
ing one example Xi at each time step, but rather a batch of s* — examples 
at time i . This way, new errors can be introduced at exponentially increasing 
time steps only, and it is enough to define sequence at as B{at, fJ,, t) = ■ 

This results in a new definition of St and in a smaller running time. Typically, 
this reduces a logt factor in the running time to some log log t factor. 

In [5], algorithms using this trick are referred to as geometric algorithms, 
by comparison with arithmetic algorithms that check some stopping condition 
at every step, or after each constant number of steps as in [12]. In [19] sam- 
pling schedules for machine learning problems are considered. Independently, 
the terms arithmetic and geometric schedules are proposed there to distinguish 
between essentially the same strategies. Also, geometric schedules are proved to 
be optimal in some precise sense. 



4 Case Study (1): Frequency Estimation 

In the following two sections, we present a couple of instances of the estimator 
and analysis above. The interested reader may want to compare this analysis 
with the lengthier ones in [4,5,20,21,23]. 

First we consider frequency estimation, i.e.. Example 1 mentioned in Sec- 
tion 2. More specifically, we assume that we have access to a series of i.i.d. 
variables Xi , X 2 , ■ ■ ■ , such that each Xi G {0, 1} takes value 1 with proba- 
bility p , and our goal is to estimate p . That is, our target p is the probability 

(or frequency) p of Xi being 1 . Clearly, we have p = Exp[(Ali -I h Xt)/t] . 

Thus, the formula we would use for estimating pt is Pt = (-’fi H ■ (In 

Hpf 

other words, the function F is defined by F{Xi,...,Xt) = (^1 H \-Xt)/t 

here.) 

We show that, in this case, the generic estimator leads to a procedure for 
(e, 6) -approximation that runs in time 0(^ In |) . (In this paper, we use 0{f) 
to denote 0(/log/) .) We also provide a similar lower bound. Recall that the 
log factor can be reduced to log log by the trick discussed above, but we omit 
its use here for simplicity. 

Both for applying our basic estimator algorithm and for deriving a lower 
bound, we use bounds provided by the Central Limit Theorem (see, e.g., [8]). 
Here we recall some basic facts about the Central Limit Theorem. Consider any 
sufficiently large t , and let Xi , X 2 , ..., Xt be a series of i.i.d. random variables 
that take value either 1 or 0 with probability p and 1 — p respectively. Define 
Yt = ■ Then the expectation and variance of Yt are pt = tp and 

at = ^/tp{l — p) respectively. Now the Central Limit Theorem for sums of 
Bernoulli variables implies the following approximation (if t is large enough). 



Yt-pt 

o-t 



Pr 






(4) 
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where exp(-z^) dz . In our analysis, we will assume that this ap- 

proximation holds. (We remark that the Chernoff bound provides a very similar 
upper bound, which is in fact provable for all t .) 

We use the following upper and lower approximations of 1 — <P{x) [8]. 



1 exp(-^V2) / _ 1\ ^ ^ 1 

y27T X \ ^ J v27r ^ 



(5) 



For x> 1.1 , the 1 — 1/a;^ factor can be replaced with 0.17 . 



4.1 An Upper Bound 



With the bound given by the Central Limit Theorem, we follow the steps in our 
“recipe” to instantiate the generic algorithm for frequency estimation. 

Step 1. We use the bound derived (or more specifically the function B{a, p, t ) ) 
from the righthand inequality of (5). For defining our B(a, p, t) , we first obtain 
an appropriate x so that the following relation holds. 

Yt-pt . 1 

< X . 

Recall the following definitions: 

Yt = (hence, Yt = tpt), 

Pt = tp, and at = ^tp{l - p). 



[{l-a)p < Pt < (l-ba)^] 



Thus, by using the fact that {l — a)p < Pt {l + a)p if and only if \pt — p\ 
< ap , the above relation can be restated as 



[\Pt - p\ < otp] 



\Pt - p\ < X 



p{l - p) 



Hence, this relation holds by letting x = asJtp/(\ — p) . Then, since we assume 
the approximation (4), we have 



Pr[ (1 - a)p < Pt < {I + a)p ] « 1 - 2(1 - ^(ai/t^/(l - p))). 
Furthermore, using the bound in (5) and assuming that cc > 1 , we have 

P,| (1 - o)p < p, < (1 + »)p I > 1 - - >■)) 

\2tt X 

2 

>1 -= exp(— a^t/x/2). 

V 27T 

Thus, B{a,p,t) A/2/7rexp( —a^tp/2) satisfies our purpose. 



( 6 ) 
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Step 2. Now define at so that the following relation holds. 
For this, it is enough to define 



def 






+ 1 ) 



for some c = 2 and d = = 0.225- •• . (For simplifying our later 

discussion, we ignore d by using a slightly larger constant for c .) 

Step 3. Assume here that fj,t < (1 + at)p. ■ Then we have 



1 , t{t+l) / / 1 , t{t+l) 



or 



at 1 t{t+l) 

< c\ - — In ; . 



VI + at 



tptt 



Since a* < 1 in all cases of interest, we have VI + ctt < 1.41 ••• , and we 
can ignore the VI + ctt factor by taking a slightly larger constant for c . Then, 
define St to be the right-hand side of this inequality and we have at < St ■ 



4.2 A Lower Bound 

Here again we assume the approximation (4) due to the Central Limit Theorem. 
Then we prove that if we use our generic algorithm for frequency estimation, it 
is impossible to bound £t by any function smaller than ci ^J{l/tpit) ln(l/5) for 
some constant ci . That is, our definition of £* = 0(-y(T7VqyinVV^) ob- 
tained above achieves almost optimal convergence speed (ignoring the difference 
in the log factor). (We note here that a more general result, including this one 
as a special case, was proved in [2] with a more complicated technique.) 

Theorem 1. For frequency estimation, suppose that we use the generic algo- 
rithm with the formula Ht = Then there is some constant ci 

such that for any 6 < 0.27 , any p, , and any t , it must hold that either St > 3/4 
or 



£t > Cl 




1 

5 



in order to guarantee that pt is an (£t,S) -approximation of p . 
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Remark. 1. This lower bound result is for our generic algorithm, and it does 
not deny all possibility of a better algorithm for frequency estimation. But, on 
the other hand, our generic algorithm is quite general, and in fact, this bound 
applies so long as the estimation Ht is calculated as the observed frequency. 

2. Note that we discuss the error probability at the tth step. Since it may be 
possible that errors occurs in earlier steps, the total error probability is larger 
than our estimation. We might be able to get a better lower bound if we also 
considered this point. 

3. In the following, though very rough estimation, we prove this bound with 
Cl = 1/4 . 

Proof. Consider any algorithm obtained from the generic one for frequency 
estimation. Suppose that, for a given 6 < 0.27, it outputs at the tth 

step. We assume that et < 3/4 (since otherwise, the theorem trivially holds). 
Furthermore, we assume that fit is an (e^, 5) -approximation of fi , or more 
precisely, the following holds. 



6 > Pr[ -.[(1 - < fit < {l + et)fi]]- 



From the approximation (6), this implies that 



S > 2{l -${et\/tfi/ {I - fi))). 

Let xt = £t\/tfi/{l - fi) . 

First we note that Xt > 1.1 . Because otherwise 2(1 — ^(a;t)) > 2(1— <?(1.1)) 
= 0.27133 ... > 0.27 , contradicting our assumption that <5 < 0.27 . Now since 
Xt > 1.1 , by using the left-side inequality of (6), we have 



,5 > 2{l-<P{xt)) > 



1 exp(-x^/2) / _ 

Xt V xl) 



1 exp(-x|/2) 

0.1772/r Xt 



Hence, for some constant d { = l/(0.17\/27r = 2.34 . . . ), we have 

^ < Xt ■ exp{xt/2). 

0 

Since we may assume that d/J > 8 , it is then easy to see that Xt must 
satisfy the following bound. 



Xt > 



2 




Hence, 



Xt 1 I 1 TT 1 /II 

y^tfI(T^^ 2yt^(l-^) S 2\l tfi S 

Here we may assume that fit > (1 — £t)fJ- holds. (Otherwise, fit clearly 
does not satisfy the approximation condition.) Then since fi < fit/{l — £t) , by 
using the assumption that < 3/4 , we have 



£t > 



1 

2 



1 , 1 
5 ^ 



2V tfit 



> 



1 

4 



tflt 0 



This completes the proof. 



□ 




Sequential Sampling Algorithms: Unified Analysis and Lower Bounds 183 



4.3 Efficiency 



For given e and S, let us estimate the time sufficient/necessary for obtaining 
an (e, S) -approximation of frequency /i by using our generic algorithm. 

First consider the algorithm we instantiated in Section 4.1 and derive an 
upper bound. From the definition of e* obtained in Section 4.1 and under the 
assumption that > (1 — £i)M > it is easy to show that et < s holds when 

ln(f2/(5) ln(t(t -I- l)/(5) “ e^/i’ 



We observe that x > 2z\n{zy) > zhi{z^y) implies x/h\{x^y) > z ; using 

this approximation, we can roughly bound the time t sufficient for having St < e 
as 



2c- 



^ ^ 2 



2c^ 



In In - -I- In 



e'^yS 









That is, with probability > 1 — 5 , we have St < £ by the time t reaches the 
bound above. 

Here it should be noted that with an analysis quite specific to frequency 
estimation, Lipton and Naughton [15] gave an algorithm that runs in time 
0( In y) , i.e., a log factor faster than this algorithm. See [23] for some details 
and reasoning of this difference. 

Next we show some lower bound from Theorem 1. Consider any instantiation 
of the generic algorithm, and let {£t, yt) be its output at the t th step. The the- 
orem claims that, for any step t , it must hold that £t > ci{y^{l/tyt) ln(l/<5)) 
in order to guarantee that yt is an (£t , <5) -approximation of y . Then clearly, 
to obtain an (e, S) -approximation of y , it must hold that 



£ 



> £t > Cl 



tyt 0 



> 



Cl 



t(l -I- £)y 




From this and using the trivial bound e: < 1 , we have the following lower bound. 



t > 



2£^y 




5 Case Study (2): Advantage Estimation 

In advantage estimation, i.e.. Example 2 stated in Section 2, an algorithm has 
access to a series of i.i.d. variables Xi , X 2 , ... such that each Xi takes only 
values 0 or 1 and Exp[Aii] = p. Here we assume that p > 1/2. Our goal 
is to estimate the advantage y = p — 1/2 . The formula we use in the generic 
algorithm is yt = {Xi -\ -|- Xt)/t — 1/2 . 
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5.1 An Upper Bound 

Step 1. Observe that we cannot apply Chernoff or (simple) Hoeffding bounds 
directly because /r is not the average of a sum of 0/1 variables. On the other 
hand, defining pt = j have 

(1 — a)/i < /Tt < (1 + oi)p, 4=^ p — an < pt < P + apL- 

Thus, by applying Hoeffding bound now, we have 

Pr[ {l — a)n < /it < (l+Qf)/i ] = Pr[ p—apL < pt < p+apL ] > 1—2 exp(— 2a^/i^t). 

Hence we define B{a,n,t) 2exp(— 2a^/t^<) . 

Notice here that the argument from now on works for any other /t for which 
a similar Hoeffding bounding can be applied. This is the case considered in [5]. 

Step 2. To have B{at,n,t) ( = 2exp(— 2a(/i^t) ) = + 1) , we define 

def / 1 , t{t+l) 



Step 3. Assuming that /it < (1 + at)n , we have 



at < 




t{t + 1) 



hence. 



at 

(1 + at) 



< 




t{t + 1) 
5 



Now define j3t to be the right-hand side of this inequality. Then by simple 
calculations, we have 



at/{l + at) < Pt 4=^ at < /3t/(l — /3t)- 

def 

Thus, we define £t = /3t/(l — /3t) • This gives us an instance of the generic 
algorithm. 



5.2 A Lower Bound 

Theorem 2. For advantage estimation, suppose that we use the generic algo- 
rithm with the formula pit = ~ 1/2 • Then there is a constant C 2 

such that for any S < 0.27 , any pi , and any t , it must hold that either St > 1/4 
or 

in order to guarantee that pit is an (et,S) -approximation of pi . 
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Proof. We work by contradiction. Assume that the theorem is false, i.e., some 
instantiation of the generic algorithm for advantage estimation does better than 
stated in the theorem. Then we will show that some instantiation of the generic 
algorithm for frequency estimation contradicts the lower bound in Theorem 1. 

In order to distinguish the pairs produced by the two instances of 

the algorithm, we will use symbol fj, to denote an advantage, and £t for the 
output of the advantage estimator, p for a frequency, and pt and 0t for the 
output of the frequency estimator. 

Assume that the body of the advantage estimator computes pt and £t as: 

compute Pt ■= (ZLi Xi) - 1/2 ; 
compute £t := /(S'*) ; 

for some function / ; then instantiate the frequency estimator algorithm with 
the assignments: 

compute Pt ■■= ^i ) ; 

compute 6t := if f{St) > 1/4 then 1 else 4 • (pt — 1/2) • f{St ) ; 

Observe that, for any fixed sequence of random examples, we have pt = /Xt + 1/2 , 
and either 6=1 (if £( > 1/4), or 9t = 4:pt£t (if £t < 1/4) . 

We show first that, with probability 1 — <5 , Pt is a (0t, <5) -approximation 
of p. Indeed, by assumption pt is a (st, i5) -approximation of p, i.e., with 
probability 1 — <5 , 

(1 — £t)p < Pi < (1 + £t)p 

As noted in the start of Section 5.1, this is equivalent to 
p- £tP<Pt <P + £tP- 

Suppose first £t > 1/4. Then 9t = 1 and (1 — 9t)p < Pt < (1 + S)p holds 
because we are assuming p > 1/2 for this proof. Otherwise, £t < 1/4; then 
9 = 4:£tPt and also pt > (1 — £t)p > p/2 . Therefore, applying the definition of 
9t and p > 1/2 , 

(1 - 9t)p = p - 9tp = p - 4:pt£t • p < p - 4(p/2)et • (1/2) =p- p£t<Pt 
and, similarly, (1 -I- 0t)p > pt . 

On the other hand, assume that for constant C 2 = ci/4 and some 6 < 0.27 , 
p, and t, we have both £t < 1/4 and £t < C 2 \/{l/tpt) ln(l/i5) . Then, 9t = 
4pi£i < Pi < 3/4 . Also, 9t = Apt£t < 4c2V^(l/t) ln(l/<f) < 4c2A/(l/tpi) ln(l/<5) . 
This contradicts Theorem 1. □ 



5.3 Efficiency 

Proceeding as in Section 4.3, we can show that the time t sufficient/necessary 
to get an (s, 6) -approximation of advantage p by our generic algorithm is 



ci 1 

^ In - < t < 



1 



2e2p' 



■ In 



1 



2e2p2 e‘^p‘^5 
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This is about l//z times bigger than the bound obtained for frequency esti- 
mation. Again, the log factor can be reduced to log log by the trick described in 
Section 3.2. 

This upper bound was presented in [4], and is implied also by results in [5]. 
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Approximate Location of Relevant Variables 
under the Crossover Distribution 



Peter Damaschke 
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Abstract. Searching for genes involved in traits (e.g. diseases), based 
on genetic data, is considered from a computational learning perspective. 
This leads to the problem of learning relevant variables of functions 
from data sampled from a certain class of distributions generalizing 
the uniform distribntion. The Fourier transform of Boolean functions 
is applied to translate the problem into searching for local extrema 
of certain functions of observables. We work out the combinatorial 
structure of this approach and illustrate its potential use. 

Keywords: Learning from samples, relevance. Boolean functions, 
Fourier transform, crossover distribution, genetics, local extrema. 



1 Introduction 

Gene hunting for complex traits. A hot topic in genetics is searching for 
genes that are involved in certain traits, such as genetic diseases. We may con- 
sider the candidate genes as variables and their alleles (different versions of the 
same gene) as their values. Whether or not an individual has a specific trait 
depends on the genotype, that is, on the combination of alleles of the relevant 
genes for this trait. Hence we may describe a binary trait by a 0,1-valued func- 
tion of the genotype. In general, traits are not fully determined by the genes, 
but also influenced by environment, nutrition, etc., or unknown factors. There- 
fore we consider probabilistic 0,1- valued functions. Our model assumption is 
that the genotype determines the probability of the considered trait, called its 
penetrance. 

Data on genotypes and corresponding traits in pedigrees have been analyzed 
for a long time by statistical methods, in order to search for relevant genes. 
Analysis becomes particularly difficult if a trait is caused by several genes (so- 
called complex traits, explicitly mentioned in [9] as a computational challenge). 
The penetrance may be an arbitrarily complicated function, due to interactions 
between various genes and proteins. We do not restrict attention to any specific 
modes of inheritance, such as single-gene dominant or recessive traits, etc. 

Relevant variables. As in the above problem domain, observable effects or 
events are often caused by certain combinations of values of r attributes among 
n candidate attributes (with r typically much smaller than n ), and one wishes 
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to conclude from given examples which attributes might be the relevant ones. 
This type of problem is well-known in the learning theory literature. 

The situation can be modeled by an unknown function / of n variables, 
which properly depends on only r of them. The goal is to learn / , first of all 
the set of relevant variables, from values f{x) for several x . Depending on the 
application, numerous versions of this problem arise: The learner may be able 
to submit freely chosen queries (assignments) x to an oracle that knows / , or 
the queries are sampled according to some known or unknown distribution. 

Here we consider learning of relevant variables of probabilistic Boolean func- 
tions from given samples, where a sample is a multiset of arguments x G {0,1}" . 
That means, for each x , f{x) is a probability distribution on the image set. For 
0,1- valued functions / , it is equivalent to say that f{x) = 1 with some prob- 
ability p{x) if assignment x is presented. We call p{x) the penetrance of / 
for X . We assume that, in a sample being a multiset of assignments x , all f{x) 
are independent. (It will become apparent in Section 5 that this independence 
assumption is justified in the intended genetics application.) 



Contributions and organization of the paper. We exploit ideas from algo- 
rithmic learning theory to derive a structural framework for gene hunting based 
on parents-children data. We use the Fourier transform of penetrances p to con- 
struct guide functions from observable parameters that “point to” the relevant 
variables in a certain sense. Since the Fourier transform represents arbitrary 
functions as linear combinations of very simple ones, we finally obtain rather 
simple guide functions. So this approach is, in principle, applicable to arbitrarily 
complex traits. One should be aware that we mainly study the combinatorial 
nature of the problem and leave out the statistical aspects. In order to decouple 
the structural issues from the statistical ones, we always implicitly assume that 
our samples are large enough to get sufficiently accurate estimates of observable 
values. (However we address the sample size in case of the uniform distribution, 
cf. Theorem 1.) 

In Section 2 we recall the well-known relationship between Fourier transform 
and relevant variables of Boolean functions and give a canonical algorithm for 
learning the relevant variables under the uniform distribution. Sections 3-5 con- 
tain the new contributions, where Section 3 is the core of the paper. It addresses 
our learning problem under the more general class of crossover distributions. 
Fourier transform along with some nice properties of the crossover distribution 
allows us to build efficient search strategies. The combinatorial Theorem 5 is the 
key result. Using this theorem we can reduce our learning problem to a concep- 
tually simpler one: searching for local extrema of the mentioned guide functions 
by function value queries (which is interesting for its own) . The idea is illustrated 
in Section 4 by a concrete construction which is then fully treated in case r = 2 . 
In Section 5 we propose several ways of applying our approach to gene hunting, 
which was the original motivation. Full exploration of the approach must be left 
for future research. Especially the properties of guide functions deserve further 
study. 




Approximate Location of Relevant Variables 



191 



Literature. Relevance is a common topic in machine learning and inference, 
see [1,8,12,17] for some recent contributions and further pointers. Learning rel- 
evant variables by queries has been intensively studied under several learning 
models, e.g. exact learning by adaptive and nonadaptive membership queries 
[4,5]. (Again, much more references can be found there.) In [11], the model we 
supposed here is called p-concepts, however the goal there is different from ours. 

The use of Fourier transform for learning Boolean functions is well- 
established, and the interplay between Fourier spectrum and relevance of vari- 
ables has also been observed several times [2,3,7,14,15], so it is impossible to 
summarize even the main results here. However most work is focused on the 
uniform sample distribution. 

The crossover distribution is well-known in genetics [10]. For a survey of gene 
hunting in pedigree data and a problem discussion we refer to [18]. A special 
aspect of modelling traits is addressed in [13] where the cases with two involved 
genes have been classified. 

2 Relevance and Fonrier Transform 

The learning problem. IR denotes the set of real numbers. A function 
p : {0, 1}” — >■ IR is called pseudo-Boolean. If 0 < p{x) < 1 for all x G {0, 1}" , 
we may interpret the p{x) as probabilities. Assume that an oracle for p behaves 
as follows: If x is submitted as a query then the oracle responds 1 and 0 with 
probability p{x) and l—p{x ) , respectively. Answers to several queries are com- 
pletely independent (including the case of repeated submission of the same x ). 
The p{x) are called penetrances. Queries are independently sampled according 
to some distribution on {0, 1}" . In the present section we consider the uniform 
distribution: every x has probability 2 “" . 

A variable v is called relevant if there exists an assignment x on the other 
variables such that p{x, 0) yf p{x, 1) . (Here the last component is the value of v , 
and X is the assignment on the other n— 1 variables. To simplify our formalism, 
we will often use this loose notation if the meaning is clear from the context.) 
R denotes the set of relevant variables (relevant set, for short) of p . 

Imagine that a learner wants to identify R (subject to some error prob- 
ability) by observing many pairs of queries x and oracle answers f{x) . The 
naive approach (estimate all p{x) by frequencies of response 1 to queries x and 
compare them) would require a huge sample with > 2" queries. However, if 
the learner knows in advance (or has good reasons to believe) that |i?| ^ r for 
some r <C n , he can learn a more efficient representation of p form a sample of 
reasonable size, as we outline now. 

Learning via Fourier transform. Let T be any subset of variables. Parity 
function Qt is defined by Qt{x) = (—1)“'^ , where x ■ T means the number 
of Is that x has in T . It is well-known that the Qt form an orthonormal 
basis in the linear space of functions p : {0, 1}" — >■ IR . The coefficients in 
p = HtQt , given by hr = 2 “” Qt{x)p{x) , are the Fourier coefficients 

of p . We omit set symbols in subscripts and write h for /i 0 , for , etc. 
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The projection p\T] of p to T is defined as follows: For any assignment z 
on T, p[T](z) is the average p{x) over all x that induce z on T . If the x 
are drawn from the uniform distribution, p[T](z) is the probability of answer 
1 under the condition that the assignment on T is z . We may consider p\T] 
either as a function of the variables in T , or as a function of all variables with 
relevant variables in T only; it will not cause confusion if we jump between 
these views. If T contains only one variable v , we simply write p[v\ . Note that 
P[T] = (Z)s hsQs)[T] = J2s hsQs[T] = J2sct ^sQs ■ 

This equation is used to prove that R = U/tg/o ^ ' Implication “ C ” is 
evident. To see “3”, observe p = p[R] = other hand, 

p = hsQs ■ Since the Fourier representation is unique, we get hs = 0 for 
all S<^R. 

Hence one can learn R under the assumption |i?| ^ r as follows: Learn the 
hx for all sets T , with \T\ = 1, 2, 3, . . . , r . Whenever hx ^ 0 , add T to the 
relevant set. Abort when r relevant variables have been found. Hence we have 
to examine 0{rf) coefficients in the worst case. 

To get estimators for the hx , rewrite the above formula as 

hx= p(a:)/2” - Y P(a;)/2”. 

x\Qt(x)=+1 x\Qt(x)=—1 

The sums are twice the conditional probabilities of answer 1 if Qx{x) = +1 
and —1 , respectively. Finally replace these two conditional probabilities with 
the observed frequencies. We may perform a statistical test with null hypothe- 
sis hx = 0 , that is, we fix a threshold D depending on sample size N , and 
consider hx to be 0 if the estimated \hx\ is below D . By routine calculation 
using the Chernoff bound, the probability that some hx = 0 is misclassified as 
hx 0 is bounded by an arbitrarily small constant if iV = 0{rlogn/ D"^) . The 
probability that some relevant variable v with \p{x,0)—p{x,l)\ >6 for some x 
remains undetected can also be made arbitrarily small, choosing D = 0{6/2^) . 
Thus N = 6>(r2^’’ log n/(5^) suffices to exclude both types of error with high 
probability. It follows: 

Theorem 1. Given a probabilistic Boolean function p with |i?| ^ r, those 
relevant variables whose change can affect p by more than S can be learned 
by a sample of size 6>(r2^’’logn/5^) under the uniform distribution, examining 
0{n^) Fourier coefficients of subsets of size at most r . 

We remark that the complexity of learning deterministic Boolean functions 
by learner-chosen assignments is roughly 0(r2’’ logn) [4]. The exponential term 
in r is not off-putting, as r is small in the intended applications. 

Efficiency. If p has particular structural properties, the above algorithm needs 
much fewer than 0{n^) Fourier coefficients to learn R . We give an example: 
A function p is called locally monotone if, for every variable v , either all as- 
signments X on the variables yf v satisfy p{x, 0) < p{x, 1) , or all x satisfy 
p{x,0) f^p{x,l). 
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Theorem 2. If p is locally monotone then hy ^ 0 for all v € R . Conse- 
quently, if the learner knows this property in advance, he can learn R from the 
0(n) coefficents of singleton sets. 

Proof. Let v G R and assume for contradiction that \T\ > 1 for all T B v 
with hr ^ 0 ■ Note that QtD[v] = 1 and Qt[v\ = 0 for all T {u} . Hence 
p[v\ = = h. That is, p[v\ is constant: pH(0) = p[r;](l). Let 

w.l.o.g. be p{x, 0) ^ p{x, 1) for all assignments x on the variables ^ v . Some 
of these inequalities is strict, due to v G R. But p[v\{i) is the average of p{x, i) , 
for z = 0, 1 - contradiction. □ 

One may ask more generally: Given an integer constant t , for which class of 
functions p can we learn R by examining 0(n^) Fourier coefficients (although 
it might be t < r )? It is natural to consider only function classes being closed 
under permutation of variables. 

Consider the following algorithm ION (“increment only if necessary”): 

Let S be the set of relevant variables found so far (initially S' = 0 ) . Starting 
with t := 1 and s := 0 , repeat the following until |S| = r or t > r : Estimate 
hx for all T with |T O S| = s and |T \ S| = t . If hr 0 then let S := S U T 
and reset t := 1 , s := 0 . If S did not grow, let s := s + 1 . If already s = |S| , 
let t := t + 1 and reset s := 0 . 

The next theorem says that ION has optimal efficiency, in a certain sense. 

Theorem 3. Let r be fixed and t ^ r . Let A be any algorithm that learns R 
provided that |i?| ^ r, and C be the class of functions where A learns R from 
0{n*) coefficients hr ■ Then ION learns R of every p G C after examination 
of 0{n*) coefficients, too. 

Proof. Consider some p , and let u be the maximum value of t during 
execution of ION if p is given. It is not hard to see that ION examines less 
than 2”n“ coefficients, which is 0(n“) for fixed r . Hence, if ION fails to learn 
R for some p in 0{n*) steps, there must be some T with hr 0 such that 
|T \ S'! > t , where S denotes the final outcome of ION. Moreover, any such 
set T has more than t elements outside S' . Now assume that A learns the 
relevant set of p in 0(rz*) steps. Then we can foil A, due to a symmetry 
argument: Since we may apply any permutation to the variables, particularly 
to the variables outside S, there are functions in C where A needs 
steps to guess any of these sets. □ 

3 Relevance and Crossover Distribution 

Linkage and crossover distribution. Given linearly ordered variables: vi < 

. . . < Vn , and positive numbers Oi ^ ^ , z=l,...,n, the crossover distribution 
is defined as follows: Wi = 0 or 1 with probability | , and yf Vi (this is 
called a switch) with probability &i , and all switches are independent. We call 
Oi the switch probability of interval [vi, Vi+i] . It can be generalized to arbitrary 
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intervals: Let i < j < fc . If [vi,Vj] and [vj,Vk] have switch probabilities A 
and /i , respectively, then [vi,Vk] has switch probability (1 — A)/i + A(l — /i) = 
A -t“ /r — 2A/i . 

The linkage L of an interval [vi,Vk\ with switch probability 9 is defined 
to be L := 1 — 26 . Variables Vi, Vk are called linked if L > 0 , and unlinked if 
L = 0 . If all variables are unlinked, we have the uniform distribution. Note that 
9 = i(l — L) and 1 — 9 = \{1 + L) . Linkage is multiplicative in the following 
sense: Consider i < j < fc . If [vi,Vj] and [vj,Vk] have linkage L and M, 
respectively, then [vi,Vk] has linkage 1— 2(A+^— 2A^) = (1— 2A)(1— 2/x) = LM . 
Hence d = — In L is additive and can be interpreted as a distance, called the 
linkage distance. Also note that L = . (These concepts were developed in the 

early days of genetics, known under the keyword Haldane map function.) The 
set of variables is uniquely partitioned into segments of pairwise linked variables, 
called linkage groups. We can place the variables of a linkage group on a finite 
segment of the real coordinate axis, with distances according to their linkage 
distances. Therefore we sometimes refer to variables as points. 



Generalizing the Fourier learning algorithm. We want to generalize the 
algorithm for learning the relevant set R to the case that assignments x come 
from a crossover distribution. In the rest of this section we derive the combi- 
natorics that will be used, in Section 4, to get fast search strategies for the 
approximate location of the relevant variables. 

Again, the learner observes p restricted to small subsets T of variables and 
then puts together these partial information to infer R . Note that we do not 
assume that the switch probabilities are a priori known. 

Let pt{z) be the probability of oracle response 1, under the condition that 
the assignment on T is 2 , and let qriz) be the probability that the assignment 
on T is z . (Under the uniform distribution we had pr = p\T] and qT = 2“l^l .) 
The pt{z) and <?t(^) are observable in the sense that they can be estimated 
well as long as T is small enough, compared to log 2 of the sample size. 

Next we study how these observables depend on the function p . Then we can 
use the result for the inverse problem of learning R and p from the observables. 
In the sequel, y and z denote assignments on R and T , respectively. We 
write qT{y,z) instead of qjnjT{y,z ) , as i? is fixed. W.l.o.g. let be RC\T = % , 
otherwise we may add dummy variables for each element in RC\T , linked to 
the original variables with linkage 1. 

First of all, note that pT{z)qT{z) = '^yP{y)qT{y, z) , and recall p = 
J2s ^sQs , S C R. This yields PT{z)qT{z) = }2s J2y Qs{y)qT{y, z) , where 
for every S , y refers merely to the assignment induced on S rather than on 
R. 

Let c{S,T,z) := J2y Qs(y)qT{y, z) . We get pT{z)qTiz) = J2s hsc{S,T, z) , 
hence pT{z)qT{z) is a linear combination of terms coming from the Fourier 
components of p . Intuitively, c{S, T, z) says how much of Qs is present in 
PT{z)qT{z) . It remains to study the c{S, T, z) for any S,T,z . Let us first look 
at the qriyjZ) appearing in c{S,T,z). 
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Calculating the observables. Let s = [S'! , t = \T\ , and k = s + t — 1 . 
Consider S' U T and let Li,...,Lk be the linkages of the k intervals between 
neighbored variables in S U T . We loosely say that linkage Li is incident to 
variable v G S UT if v is one of the endpoints of the i th interval. 

If o denotes the all-0 assignment, we get right from the definition of crossover 
distribution: 2'^+^qT{o,o) = ]li(l + ^*) = Etc O^eic Li , where K runs over all 
subsets of {1, . . . , A:} . We call the Lk ■= Ilietf linkage products. This also 
gives grig, z) for arbitrary y, z . Only some “ -I- ” in the above formula turn to 
“ — ” : If a variable v changes from 0 to 1 then the (one or two) linkages incident 
to V change their signs. Hence Lk changes its sign iff K contains exactly one 
linkage incident to v . 

We can also write the c{S, T, z) as linear combinations of linkage products: 
Let us start with z = o again. Remember that c{S,T,o) = ^yQs{y)<lT{y,o) , 
and ^^(o, o) = Lif , where the K are all subsets of {1, . . . , s-|-t — 1} . 

If we change any v G S then Lk changes its sign in qxijjjo) iff K contains 
exactly one linkage incident to v . But the parity of y changes, too, thus: 

Lemma 1. If some v € S is changed then Lk changes its sign in c{S, T, o) 
iff K contains no or two linkages incident to v . 

All assignments y on S are obtained from y = o hy changing variables from 
0 to 1. So what is the coefficient of Lk in c{S,T,o) 7 If K contains, for each 
V G S , exactly one incident linkage then, by Lemma 1, the sign of Lk never 
changes, thus its coefficient in c{S, T, z) is . All other Lk have coefficient 
0. Namely, there exists v G S such that K contains no or two linkages incident 
to V . Consider the pairing of all y where mates differ in the value of v only. 
These mates have opposite parities, hence the sum of their contributions is 0. 
We get: 

Lemma 2. c{S, T, o) = 2~* '^k > where K runs over those subsets of link- 

ages which contain, for every v € S , exactly one linkage incident to v . 

By a couple of definitions, we can formulate this in a more concise way. 

Definition 1. A chain in the ordered set S U T is an inclusion-maximal sub- 
sequence of variables from S which is not interrupted by variables from T . An 
inner chain is between two variables of T , otherwise we speak of an outer chain. 
If S UT starts or ends with a member of T , we also say that it starts or ends 
with have an empty outer chain. An empty inner chain is the space between two 
neighbored members of T in the sequence. A linkage product Lk is a chain 
product if K consists only of linkages incident to the variables of some chain, 
and satisfies the following: K contains every second linkage, and, in case of a 
nonempty outer chain, K contains the outermost linkage. A big chain product 
is a product of chain products, one from every chain. 

Note that we have exactly t -I- 1 chains, that every inner/outer chain has 
two/one chain product (s), with the understanding that an empty set of linkages 
has product 1, and that the number of big chain products is thus exactly 2*“^ . 
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The definition is easier than it sounds. An example illustrates the notion: If 
the elements of T and S are ordered like this: SSSTTSSSST, then we have 
four chains of size 3, 0,4,0, and the chain products are ; 1 , L 4 ; L^LfLg , 

LeLg ; 1 . 

Now Lemma 2 implies: 

Theorem 4. c{S,T,o) is times the sum of all big chain products. 

Similarly, we can extend the theorem to arbitrary c{S, T, z ) , starting from 
c{S, T, o) : If some v G T changes then the signs of linkages incident to v change, 
and this holds for every fixed y . Therefore Lk changes its sign in c{S, T, z) iff 
K contains exactly one linkage incident to v . This proves: 

Theorem 5. c{S,T,z) is 2“* times the sum of all big chain products, with 
sign + or — . The sign of a big chain product is +/— ijf it contains, for an 
even/odd number of v G T with value 1 , exactly one linkage incident to v . In 
particular, c{%,T,z) = 1/2 . 

Together with pT{z)qT{z) = hsc{S,T, z ) , this allows us to compute all 
the pT{z)qT{z ) , for arbitrary constellations of T and R . Next we demonstrate 
the application of Theorem 5 to our learning problem. 

4 Design and Use of Gnide Functions 

Let one variable move. We now outline the idea of using Theorem 5 to con- 
struct functions that support efficient search for R . We concentrate on the case 
that all variables are linked. Extension to several linkage groups is straightfor- 
ward. 

The learner has to observe the Pt{z) and qxiz) on several carefully chosen 
sets T . In the following we describe a way of choosing test sets for approaching 
the relevant variables quickly. 

We fix all variables of T but one, and let this variable v “move” on the 
linkage distance axis, i.e. our sets T differ in the position of v only. Let x be the 
coordinate of v . We define some real- valued function g with argument x , where 

5 is a linear combination of terms pT{z)qT{z) for several fixed assignments 2 
on T . 

Remember that the pT{z)qT{z) are in turn linear combinations of linkage 
products, and every linkage is some or e~^ , with constant factors. It follows 
that a function g constructed in this way is piecewise a linear combination of 
e“^,e~“,l, where the pieces are the intervals between variables from R. Our 
aim is to assemble g in such a way that its local extrema are positions of 
variables from R . These local extrema can be found quickly by a binary-search- 
like technique. Ideally this leads us to all relevant variables in a linkage group, 
by examining a logarithmic number of observables, rather than 0 (n”) under 
the uniform distribution. 

Example of guide functions from small T. It follows a concrete construction 
of a guide function. We use T with three elements and place two of them at the 
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leftmost point (w.l.o.g. 0) and the rightmost point of the linkage group, whereas 
the middle element (point x ) is moving. 

To calculate the observables in terms of x and p , we first consider a com- 
ponent Qs of p , for a fixed S Q R with hg ^ Q . The elements of S form two 
inner chains, thus we have four big chain products. We denote them 6a, (0,0) , 
bx{0, 1) , bx{f, 0) , bx{f, 1) , respectively, where the first component says whether 
the big chain product starts with the first linkage (1) or not (0), and the second 
component says whether exactly one linkage is incident to x (1) or not (0). 

Example: If the ordering is TSSSTSSSST then bx(0,0) = L 2 L 4 L^L’jLg , 
bx{0,f) = LgL^L^Ls , bx{f,0) = LiL^LqLs , 6a,(l, 1) = T1L3L5T7L9 . 

Theorem 5 yields 8c(5, T, (0, 0, 0)) = 6,(1, 0) + 6,(1, 1) + 6,(0, 0) + 6,(0, 1) , 
and similarly, 8c{S, T, (0, 1, 0)) = 6,(1, 0) — 6,(1, 1) -I- 6,(0, 0) — 6,(0, 1) . Defining 
gs(x) := T, (0, 0, 0)) - 4c(S', T, (0, 1, 0)) we get gs(x) = 6,(1, 1) -b 6,(0, 1) . 
Finally we define g := hggs ■ Note that this g is in fact computable from 
the observables: Due to Section 3 we have: 
g = 4pr(0, 0, 0)9t(0, 0, 0) - 4pt(0, 1, 0)q't(0, 1, 0) . 

We continue studying gs{x) = 6,(1, 1) + 6,(0, 1) for a fixed S . Let xi < 
. . . < Xs be the coordinates of the variables in S , and Xi < x < Xi+\ . We easily 
see 

gg(^x) = Q~^l~^^'2~^3 + ---+^i-l—Xi+X — Xi4i + ..._^^Xi—X2+X3—X4 + ... + Xi—X+Xi4i—Xi42 + --- 

gg(^x) = g~^l~^^2-X3 + ...+Xi-X + Xi 4 i-Xi 42 + --- _^gXi-X2+X3-X4 + ... + Xi_i-Xi+X-Xi4i + ... 

for odd and even i , respectively. If we remove the outer elements from T (or put 
them an infinite linkage distance away) then gs(x) simplifies to 6,(0, 1) . Since 
now |r| = 1 and (?t( 0 ) = <?t( 1) = 5 , we also have gs = 2 pt( 0) — 2pT{f) = 
4pt( 0) — 2 . Moreover, if s is even then 6,(0, 1) = 0 for all x . Thus we get the 
surprising 

Proposition 1. The components hgQs of p with even [S'! have no influence 
at all on the pt for |T| = 1 . 

In particular, if p has nonzero Fourier coefficients for even [S'! only, we need 
sets T with |T| ^ 2 in order to learn any relevant variable. On the other hand, 
we have: 

Proposition 2. For any S within a linkage group, there exists a guide function 
gs with |T| = 2 having its local maxima at positions of variables from S . 

Namely, if we remove only the rightmost element from our original T then 
we obviously get gs{x) = 6,(1, 1) if s is even, and gs{x) = 6,(0, 1) if s is 
odd, and in both cases gs has the desired property. (Remember that, in big 
contrast to this, a nonzero hg cannot be recognized at all from sets T smaller 
than S under the uniform distribution.) There remains the problem that p has 
in general several Fourier components whose guide functions gg might sum up 
to a constant, such that we get no local extremum at all. Intuitively this seems 
to be a sporadic case, but it remains to investigate how to avoid such bad cases 
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safely, by clever combinations of observables. Below we illustrate this matter in 
the simplest non-trivial case r ^ 2 . 

Detecting two linked relevant variables. We show how the above construc- 
tion works for r ^ 2 , but it becomes apparent that the method can be extended, 
with some care, to arbitrary r . We write Q\ instead of , etc. 

First let be r = 1 , R = {xi} , and p = h + hiQi . We use |T| = 1 , so 
coefficient h doesn’t matter, and we simply get g = . Our g has a 

(sharp) local extremum at xi . That means, to learn R we only have to learn the 
local extremum of g by querying g{x) at several points x that can be selected 
by the learner. This result could be easily obtained without the general theory 
from Section 3, but already case r = 2 would be cumbersome without Fourier 
decomposition and linkage products, such that their benefits become obvious 
now. 

Let R = {xi, X 2 } with xi < X 2 , and p = h + h\Qi + /i2<52 + ^i 2 Qi 2 • We 
start with |T| = 1 . Due to the preceding discussion we get g{x) = + 

. If one of hi,h 2 is 0 then we are in the situation of case r = 1 
and can find the other relevant variable as above. Let be /ii , /12 yf 0 . Note that 
g tends to 0 if x goes to ± 00 . If hi.h^ have equal sign then the pieces of 
g outside interval [xi,X 2 ] are strictly monotone. If ft. 1,^-2 have different signs 
then g inside [xi,X 2 ] is strictly monotone. In both cases we easily conclude 
that at least one of Xi,X 2 is a local extremum. 

It remains case fti = ft 2 = 0 , so p = ft -I- hi 2 Qi 2 , ft > 0 , ft ^ |fti 2 | . By 
Proposition I we have to trouble a guide function g defined with |T| = 2 . Our 
g{x) defined above is: 



he~^ + hi2e~^+^^-^^ 


if 


X < Xi 


he~^ + fti2e-"’i+"’-"’= 


if 


Xi < X < X2 


fte""’ -k fti2e-"’i+"’"-"’ 


if 


X2 < X 



Since an estimator for x can be computed from the qxiz) , we may multiply g 
by e“ to get a somehow nicer guide function in this case: 

( h + hi2e^^~^'^ if X < xi 

g{x)e^ = < ft + if xi < x < X 2 

y h + hi2e~^^~^^'^ if X2 < X 

This function is constant outside [xi,X 2 ] , such that xi,X 2 can be obtained 
quickly by binary search with help of 0(log n) function values. 

A property of these guide functions. In the following we review the case 
|T| = 1 . Our g = hsgs defined above satisfies lima,_>oo = linix->.-oo = 0 , 
and g is piecewise a linear combination of e“, e~^ . In particular g" = g for all x 
where g is differentiable, and these are all x except positions of variables in R . 
Hence g is, between points of R , convex if g > 0 and concave if g < 0 . Hence 
every local maximum above 0 and every local minimum below 0 indicates a 
position from R . Moreover, at least one such extremum exists unless the entire 
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g is constant. If p , and all restrictions of p where the Boolean values of some 
relevant variables are fixed, have nonzero Fourier components on sets of odd size, 
then this observation can be used to detect the relevant variables successively. 
The given p satisifies the mentioned condition e.g. if hy yf 0 for all single 
variables w , as it is true for locally monotone p (see Theorem 2), but also in 
other cases: It is easy to see that the condition is equivalent to the following: 
The incidence matrix of variables and sets S with hs 0 has full rank over 
GF{2) . 

Learning local extrema of a real- valued function g on an ordered domain 
of n points by function value queries is used to find the position of a relevant 
variable, once we have constructed a suitable guide function. Basically, we may 
use Golden section search (see e.g. [6,16]): 

Let us call a triple of points (xi,X 2 , X 3 ) with xi < X 2 < X 3 non-monotone if 
g{x\) ^ g{x 2 ) and g{x 2 ) ^ g{x 3 ) , with at least one of these inequalities being 
strict. If |(?(a;)| is monotone increasing/decreasing at the left/right end (as it 
is true for our g above), we immediately get a non-monotone triple to start 
with: Take the two outermost points and a further point close to the leftmost or 
rightmost one, with larger |(/(a;)| . Then e.g. the following is well-known: 

Theorem 6. Once we have a non-monotone triple, we can find a local maximum 
by log^n queries to function values, with p = |(-\/5+l) , and this query number 
is worst-case optimal. 

In our application, the g{x) are noisy, as they come from estimated proba- 
bilities. Hence also the comparison results become more and more random when 
we approach a local extremum. The accuracy depends on the sample size. But at 
least we can greatly restrict the intervals where relevant variables can be found 
(by direct inspection) and thus quickly exclude the vast majority of candidates. 

5 Application to Genetics 

Trivial application. First consider a fixed pair of “parents” with many “chil- 
dren”. (Since large sample sizes are required, this first approach is not suitable 
in human genetics, but e.g. for plants.) Chromosomes come in pairs with ho- 
mologous genes at the same loci, that is, every individual has two versions of 
each chromosome. In the parental genotypes, name the chromosomes in each 
pair arbitrarily by 0 and 1. For each gene, a “child” inherits the allele from 
chromosome 0 or 1 from each “parent” . This immediately defines a randomized 
Boolean function p for the offspring of this pair, with two variables for each 
gene. Note that the random genotypes of “children” are independent. If a gene 
is not involved in the trait then both variables are irrelevant. (However, not all 
genes being relevant for the trait at all must show up in a particular “family” .) 
Linkage groups correspond to chromosomes, hence, if the candidate genes for 
the trait are on mutually different chromosomes then the variables are unlinked, 
such that assignments x are uniformly distributed in the “sibship” . 
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Affected siblings. Consider families with two (or more) children and define 
another function as follows. As above, introduce a Boolean variable for each 
gene and parent, but now assign value 0/1 to a variable if the siblings inherited 
equal/different alleles. (In case 0 the siblings’ alleles are called “identical by 
descent” in the genetics literature.) Notre that assignments of several disjoint 
pairs of siblings are independent. Define to be the probability that both 
siblings have the trait. 

We still restrict attention to the case that the candidate genes are unlinked. 
Then the assignments defined above follow the uniform distribution, too. Let 
be the relevant set of . The theorem below ensures that we do not miss 
relevant variables when using p^^^ instead of p : 

Theorem 7. = R . 

Proof. Inclusion “ C ” is evident. Let v G R . 

Let © denote the componentwise XOR of two vectors. We have: 

= ‘2~"‘J2x®y=zP(^)Piy) = III III® following, the 

last component of arguments refers to v , whereas the first component is the 
assignment on all other variables. Then we can write the above formula for 
the all-0 assignment as p^^)(o, 0) = 2“” ^^(p(a;, 0)^ +p(a;, 1)^) and p(^)(o, 1) = 
2“” 2p(x, 0)p(a;, 1) . For every x we have p(x, 0)^+p(cc, 1)^ ^ 2p(x, 0)p(x, 1) , 

with strict inequality iff p(x,0) yf p{x, 1) . Hence p^^^(o, 0) > p^^\o, 1) , which 
proves V € R^‘^'1 . □ 



The last inequality in the proof carries over to any mixture (positive linear 
combination) of several functions p^^^ . Hence any relevant gene leads to some 



relevant variable, even if our sample is from many families with two children, as 
in human genetics. (Function p from above does not share this property.) 

In the case of rare traits it is suitbale to restrict a sample to families with af- 
fected first child (“censoring”). Define assignments 2 by shared alleles as above. 



and let p*-^^ be the probability that the second child also has the trait (“siblings 
relative risk”). Since p^^^(z) = 






ExPfo) 



„( 2 ) 






T.xP(^) 



this function has the same nice property as p^^^ . 



Putting things together: Crossover distribution. Finally we come to the 
general case that several candidate genes may reside on the same chromosome 
as well. We briefly lecture some basic facts about inheritance. In the process 
of meiosis, a cell with only one set of chromosomes (gamete), is produced from 
a normal cell with double set of chromosomes, and then both parents trans- 
mit these gametes which fuse into a complete cell of a new individual. Each 
chromosome created during meiosis consists alternatingly of segments of both 
chromosomes of the parent. Every switch from one chromosome to the other one 
is called a crossover, and crossovers at different points are independent, such 
that the assignments x defined above follow a crossover distribution, and they 
are independent for different individuals. 
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The assignments 2 from the definition of also follow the crossover dis- 
tribution. Furthermore p^\o)qT{o) = YhziPT{z)qT{z)Y ■ This yields another 
possibility of constructing guide functions with similar properties as in Section 
4, which are even applicable to affected siblings from many families. For exam- 

(o'x 

pie, if we take \T\ = 1 then, due to Theorem 5, p). {6) qr{o) has the form 
(c -I- u{x)Y + {c — u{x)Y where c and u{x) is the term from the even and 
odd Fourier components, respectively. This equals 2 (m(cc)^ -I- c^) , and we have 
lima;_>±oo w(x) = 0 . Moreover u" = u yields (m^)" = {2uu')' = 2{u'‘^ + u^) > 0 . 
Hence our function is convex between points of R , such that local maxima 
indicate relevant variables. 
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